gpt4 book ai didi

elasticsearch - 如何使用 Elasticsearch 摄取附件插件索引 pdf 文件?

转载 作者:行者123 更新时间:2023-11-29 02:52:41 25 4
gpt4 key购买 nike

我必须使用 Elasticsearch 摄取插件在 pdf 文档中实现基于全文的搜索。当我尝试在 pdf 文档中搜索单词 someword 时,我得到一个空的命中数组。

//Code for creating pipeline

PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : -1
}
}
]
}

//Code for creating the index

PUT my_index/my_type/my_id?pipeline=attachment
{
"filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf",
"title" : "Quick",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="

}

//Code for searching the word in pdf

GET /my_index/my_type/_search
{
"query": {
"match": {
"data" : {
"query" : "someword"
}
}
}

最佳答案

当您使用第二个命令通过传递 Base64 编码的内容来索引您的文档时,该文档将如下所示:

        {
"filename": "C:\\Users\\myname\\Desktop\\bh1.pdf",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
},
"title": "Quick"
}

因此您的查询需要查看 attachment.content 字段而不是 data 字段(它仅用于在索引期间发送原始内容)

将您的查询修改为此,它将起作用:

POST /my_index/my_type/_search
{
"query": {
"match": {
"attachment.content": { <---- change this
"query": "lorem"
}
}
}
}

PS:发送负载时使用POST而不是GET

关于elasticsearch - 如何使用 Elasticsearch 摄取附件插件索引 pdf 文件?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42109961/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com