MSIPO技术圈 首页 IT技术 查看内容

SpringBoot+ElasticSearch实现文档内容抽取、高亮分词、全文检索

2024-03-27

需求
产品希望我们这边能够实现用户上传PDF、WORD、TXT之内得文本内容,然后用户可以根据附件名称或文件内容模糊查询文件信息,并可以在线查看文件内容。

一、环境
项目开发环境:

后台管理系统springboot+mybatis_plus+mysql+es
搜索引擎:elasticsearch7.9.3 +kibana图形化界面

二、功能实现
1.搭建环境
es+kibana的搭建这里就不介绍了,网上多的是

后台程序搭建也不介绍,这里有一点很重要,Java使用的连接es的包的版本一定要和es的版本对应上,不然你会有各种问题

2.文件内容识别
第一步: 要用es实现文本附件内容的识别,需要先给es安装一个插件:Ingest Attachment Processor Plugin

这知识一个内容识别的插件,还有其它的例如OCR之类的其它插件,有兴趣的可以去搜一下了解一下

Ingest Attachment Processor Plugin是一个文本抽取插件,本质上是利用了Elasticsearch的ingest node功能,提供了关键的预处理器attachment。在安装目录下运行以下命令即可安装。

到es的安装文件bin目录下执行

elasticsearch-plugin install ingest-attachment

因为我们这里es是使用docker安装的,所以需要进入到es的docker镜像里面的bin目录下安装插件

[root@iZuf63d0pqnjrga4pi18udZ plugins]# docker exec -it es bash
[root@elasticsearch elasticsearch]# ls
LICENSE.txt  NOTICE.txt  README.asciidoc  bin  config  data  jdk  lib  logs  modules  plugins
[root@elasticsearch elasticsearch]# cd bin/
[root@elasticsearch bin]# ls
elasticsearch          elasticsearch-certutil  elasticsearch-croneval  elasticsearch-env-from-file  elasticsearch-migrate  elasticsearch-plugin         elasticsearch-setup-passwords  elasticsearch-sql-cli            elasticsearch-syskeygen  x-pack-env           x-pack-watcher-env
elasticsearch-certgen  elasticsearch-cli       elasticsearch-env       elasticsearch-keystore       elasticsearch-node     elasticsearch-saml-metadata  elasticsearch-shard            elasticsearch-sql-cli-7.9.3.jar  elasticsearch-users      x-pack-security-env
[root@elasticsearch bin]# elasticsearch-plugin install ingest-attachment
-> Installing ingest-attachment
-> Downloading ingest-attachment from elastic
[=================================================] 100%?? 
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@     WARNING: plugin requires additional permissions     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.lang.RuntimePermission accessClassInPackage.sun.java2d.cmm.kcms
* java.lang.RuntimePermission accessDeclaredMembers
* java.lang.RuntimePermission getClassLoader
* java.lang.reflect.ReflectPermission suppressAccessChecks
* java.security.SecurityPermission createAccessControlContext
* java.security.SecurityPermission insertProvider
* java.security.SecurityPermission putProviderProperty.BC
See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.
 
Continue with installation? [y/N]y
-> Installed ingest-attachment

显示installed 就表示安装完成了,然后重启es,不然第二步要报错

第二步:创建一个文本抽取的管道

主要是用于将上传的附件转换成文本内容,支持(word,PDF,txt,excel没试,应该也支持)
在这里插入图片描述

{
    "description": "Extract attachment information",
    "processors": [
        {
            "attachment": {
                "field": "content",
                "ignore_missing": true
            }
        },
        {
            "remove": {
                "field": "content"
            }
        }
    ]
}

第三步:定义我们内容存储的索引
在这里插入图片描述

{
  "mappings": {
    "properties": {
      "id":{
        "type": "keyword"
      },
      "fileName":{
        "type": "text",
        "analyzer": "my_ana"
      },
      "contentType":{
        "type": "text",
         "analyzer": "my_ana"
      },
       "fileUrl":{
        "type": "text"
      },
      "attachment": {
        "properties": {
          "content":{
            "type": "text",
            "analyzer": "my_ana"
          }
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "jieba_stop": {
          "type":        "stop",
          "stopwords_path": "stopword/stopwords.txt"
        },
        "jieba_synonym": {
          "type":        "synonym",
          "synonyms_path": "synonym/synonyms.txt"
        }
      },
      "analyzer": {
        "my_ana": {
          "tokenizer": "jieba_index",
          "filter": [
            "lowercase",
            "jieba_stop",
            "jieba_synonym"
          ]
        }
      }
    }
  }
}

mapping:定义的是存储的字段格式
setting:索引的配置信息,这边定义了一个分词(使用的是jieba的分词)

注意:内容检索的是attachment.content字段,一定要使用分词,不使用分词的话,检索会检索不出来内容

第四步:测试

在这里插入图片描述

{
    "id":"1",
 "name":"进口红酒",
 "filetype":"pdf",
    "contenttype":"文章",
 "content":"文章内容"
}

测试内容需要将附件转换成base64格式

在线转换文件的地址:https://www.zhangxinxu.com/sp/base64.html

查询刚刚上传的文件:
在这里插入图片描述

{
    "took": 861,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 5,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "fileinfo",
                "_type": "_doc",
                "_id": "lkPEgYIBz3NlBKQzXYX9",
                "_score": 1.0,
                "_source": {
                    "fileName": "测试_20220809164145A002.docx",
                    "updateTime": 1660034506000,
                    "attachment": {
                        "date": "2022-08-09T01:38:00Z",
                        "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
                        "author": "DELL",
                        "language": "lt",
                        "content": "内容",
                        "content_length": 2572
                    },
                    "createTime": 1660034506000,
                    "fileUrl": "http://localhost:8092/fileInfo/profile/upload/fileInfo/2022/08/09/测试_20220809164145A002.docx",
                    "id": 1306333192,
                    "contentType": "文章",
                    "fileType": "docx"
                }
            },
            {
                "_index": "fileinfo",
                "_type": "_doc",
                "_id": "mUPHgYIBz3NlBKQzwIVW",
                "_score": 1.0,
                "_source": {
                    "fileName": "测试_20220809164527A001.docx",
                    "updateTime": 1660034728000,
                    "attachment": {
                        "date": "2022-08-09T01:38:00Z",
                        "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
                        "author": "DELL",
                        "language": "lt",
                        "content": "内容",
                        "content_length": 2572
                    },
                    "createTime": 1660034728000,
                    "fileUrl": "http://localhost:8092/fileInfo/profile/upload/fileInfo/2022/08/09/测试_20220809164527A001.docx",
                    "id": 1306333193,
                    "contentType": "文章",
                    "fileType": "docx"
                }
            },
            {
                "_index": "fileinfo",
                "_type": "_doc",
                "_id": "JDqshoIBbkTNu1UgkzFK",
                "_score": 1.0,
                "_source": {
                    "fileName": "txt测试_20220810153351A001.txt",
                    "updateTime": 1660116831000,
                    "attachment": {
                        "content_type": "text/plain; charset=UTF-8",
                        "language": "lt",
                        "content": "内容",
                        "content_length": 804
                    },
                    "createTime": 1660116831000,
                    "fileUrl": "http://localhost:8092/fileInfo/profile/upload/fileInfo/2022/08/10/txt测试_20220810153351A001.txt",
                    "id": 1306333194,
                    "contentType": "告示",
                    "fileType": "txt"
                }
            }
        ]
    }
}

我们调用上传的接口,可以看到文本内容已经抽取到es里面了,后面就可以直接分词检索内容,高亮显示了

三.代码
介绍下代码实现逻辑:文件上传,数据库存储附件信息和附件上传地址;调用es实现文本内容抽取,将抽取的内容放到对应索引下;提供小程序全文检索的api实现根据文件名称关键词联想,文件名称内容全文检索模糊匹配,并高亮显示分词匹配字段;直接贴代码

yml配置文件:

# 数据源配置
spring:
    # 服务模块
    devtools:
        restart:
            # 热部署开关
            enabled: true
    # 搜索引擎
    elasticsearch:
        rest:
            url: 127.0.0.1
            uris: 127.0.0.1:9200
            connection-timeout: 1000
            read-timeout: 3000
            username: elastic
            password: 123456

elsticsearchConfig(连接配置)

package com.yj.rselasticsearch.domain.config;
 
import org.apache.http.HttpHost;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
 
import java.time.Duration;
 
@Configuration
public class ElasticsearchConfig {
    @Value("${spring.elasticsearch.rest.url}")
    private String edUrl;
    @Value("${spring.elasticsearch.rest.username}")
    private String userName;
    @Value("${spring.elasticsearch.rest.password}")
    private String password;
 
    @Bean
    public RestHighLevelClient restHighLevelClient() {
        //设置连接的用户名密码
        final BasicCredentialsProvider credentialsProvider = new BasicCredentialsProvider();
        credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials(userName, password));
        RestHighLevelClient client =  new RestHighLevelClient(RestClient.builder(
                        new HttpHost(edUrl, 9200,"http"))
                .setHttpClientConfigCallback(httpClientBuilder -> {
                    httpClientBuilder.disableAuthCaching();
                    //保持连接池处于链接状态,该bug曾导致es一段时间没使用,第一次连接访问超时
                    httpClientBuilder.setKeepAliveStrategy(((response, context) -> Duration.ofMinutes(5).toMillis()));
                    return httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider);
                })
        );
        return client;
    }
}

文件上传保存文件信息并抽取内容到es

实体对象FileInfo

package com.yj.common.core.domain.entity;
 
import com.baomidou.mybatisplus.annotation.TableField;
import com.yj.common.core.domain.BaseEntity;
import lombok.Data;
import lombok.EqualsAndHashCode;
import lombok.Getter;
import lombok.Setter;
import org.springframework.data.elasticsearch.annotations.Document;
import org.springframework.data.elasticsearch.annotations.Field;
import org.springframework.data.elasticsearch.annotations.FieldType;
 
import java.util.Date;
 
@Setter
@Getter
@Document(indexName = "fileinfo",createIndex = false)
public class FileInfo {
    /**
    * 主键
    */
    @Field(name = "id", type = FieldType.Integer)
    private Integer id;
 
    /**
    * 文件名称
    */
    @Field(name = "fileName", type = FieldType.Text,analyzer = "jieba_index",searchAnalyzer = "jieba_index")
    private String fileName;
 
    /**
    * 文件类型
    */
    @Field(name = "fileType",  type = FieldType.Keyword)
    private String fileType;
 
    /**
    * 内容类型
    */
    @Field(name = "contentType", type = FieldType.Text)
    private String contentType;
 
    /**
     * 附件内容
     */
    @Field(name = "attachment.content", type = FieldType.Text,analyzer = "jieba_index",searchAnalyzer = "jieba_index")
    @TableField(exist = false)
    private String content;
 
    /**
    * 文件地址
    */
    @Field(name = "fileUrl", type = FieldType.Text)
    private String fileUrl;
 
    /**
     * 创建时间
     */
    private Date createTime;
 
    /**
     * 更新时间
     */
    private Date updateTime;
}

controller类

package com.yj.rselasticsearch.controller;
 
import com.yj.common.core.controller.BaseController;
import com.yj.common.core.domain.AjaxResult;
import com.yj.common.core.domain.entity.FileInfo;
import com.yj.rselasticsearch.service.FileInfoService;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;
 
import javax.annotation.Resource;
 
/**
 * (file_info)表控制层
 *
 * @author xxxxx
 */
@RestController
@RequestMapping("/fileInfo")
public class FileInfoController extends BaseController {
    /**
     * 服务对象
     */
    @Resource
    private FileInfoService fileInfoService;
 
 
    @PutMapping("uploadFile")
    public AjaxResult uploadFile(String contentType, MultipartFile file) {
        return fileInfoService.uploadFileInfo(contentType,file);
    }
}

serviceImpl实现类

package com.yj.rselasticsearch.service.impl;
 
import com.alibaba.fastjson.JSON;
import com.baomidou.mybatisplus.core.conditions.query.LambdaQueryWrapper;
import com.yj.common.config.RuoYiConfig;
import com.yj.common.core.domain.AjaxResult;
import com.yj.common.utils.FastUtils;
import com.yj.common.utils.StringUtils<

相关阅读

热门文章

    手机版|MSIPO技术圈 皖ICP备19022944号-2

    Copyright © 2024, msipo.com

    返回顶部