gpt4 book ai didi

pdf - 尝试索引 PDF 时出现 Elasticsearch Parse Exception 错误

转载 作者:行者123 更新时间:2023-11-29 02:44:19 25 4
gpt4 key购买 nike

我刚刚开始使用 elasticsearch。我们的要求是我们需要为数千个 PDF 文件编制索引,而我很难只让其中一个文件成功编制索引。

安装附件类型插件并得到响应:Installed mapper-attachments

已关注 Attachment Type in Action tutorial但是进程挂起并且我不知道如何解释错误消息。还尝试了 gist卡在同一个地方。

$ curl -X POST "localhost:9200/test/attachment/" -d json.file 
{"error":"ElasticSearchParseException[Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]]","status":400}

更多详情:

json.file 包含一个嵌入式 Base64 PDF 文件(按照说明)。文件的第一行看起来是正确的(无论如何对我来说):{"file":"JVBERi0xLjQNJeLjz9MNCjE1OCAwIG9iaiA8...

我不确定 json.file 是否无效,或者 elasticsearch 是否未设置为正确解析 PDF?!?

编码 - 以下是我们如何将 PDF 编码为 json.file(按照教程):

coded=`cat fn6742.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
json="{\"file\":\"${coded}\"}"
echo "$json" > json.file

也尝试过:

coded=`openssl base64 -in fn6742.pdf

日志:

[2012-06-07 12:32:16,742][DEBUG][action.index             ] [Bailey, Paul] [test][0], node[AHLHFKBWSsuPnTIRVhNcuw], [P], s[STARTED]: Failed to execute [index {[test][attachment][DauMB-vtTIaYGyKD4P8Y_w], source[json.file]}]
org.elasticsearch.ElasticSearchParseException: Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]
at org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:147)
at org.elasticsearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:50)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:451)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:437)
at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:290)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:210)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)

希望有人能帮我看看我遗漏了什么或做错了什么?

最佳答案

下面的错误指向了问题的根源。

Failed to derive xcontent from (offset=0, length=9): [106, 115, 111, 110, 46, 102, 105, 108, 101]

UTF-8 代码 [106, 115, 111, ...] 表明您正在尝试索引字符串“json.file”而不是文件内容。

要索引文件内容,只需在文件名前添加字母“@”即可。

curl -X POST "localhost:9200/test/attachment/" -d @json.file

关于pdf - 尝试索引 PDF 时出现 Elasticsearch Parse Exception 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11017543/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com