gpt4 book ai didi

configuration - Apache Tika App 配置文件

转载 作者:行者123 更新时间:2023-12-04 12:43:00 31 4
gpt4 key购买 nike

我在我的 Ubuntu 16.04 服务器上使用 Apache Tika App 作为命令行工具来提取文档内容。

[Apache Tika 网站][1] 说明如下:

Build artifacts

The Tika build consists of a number of components and produces the following main binaries:

tika-core/target/tika-core-*.jar Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 6.

tika-parsers/target/tika-parsers-*.jar Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.

tika-app/target/tika-app-*.jar Tika application. Combines the above components and all the external parser libraries into a single runnable jar with a GUI and a command line interface.



所以我下载了 tika-app-*.jar的最后一个版本(1.18) .那只是一个文件。

在像 java -jar tika-app-1.18.jar -t <filename> 这样的命令行中运行它给我所需的文件内容输出,但每次我收到两个警告时:

July 28, 2018 3:29:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.

July 28, 2018 3:29:27 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version.



我不知道这些警告是否会减慢速度,但在这些重复警告中很难跟踪其他输出。

我试图通过以下方式将 Tika 指向我自己的配置文件:
java -jar tika-app-1.18.jar --config=tika-config.xml -t <filename>
我的 tika-config.xml 文件是:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>image/jpeg</mime-exclude>
<mime-exclude>application/x-sqlite3</mime-exclude>
<parser-exclude class="org.apache.tika.parser.jdbc.SQLite3Parser"/>
</parser>
</parsers>
</properties>

如果我使用该配置,我会得到 No protocol: filename.doc并且警告仍然存在。

如何排除 jpeg 和 sqlite 解析器?

最佳答案

我的解决方案是这个 tika-config.xml 文件:

 <?xml version="1.0" encoding="UTF-8"?>
<properties>
<service-loader loadErrorHandler="IGNORE"/>
<service-loader initializableProblemHandler="ignore"/>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>image/jpeg</mime-exclude>
<mime-exclude>application/x-sqlite3</mime-exclude>
<parser-exclude class="org.apache.tika.parser.jdbc.SQLite3Parser"/>
</parser>
</parsers>
</properties>

然后设置:
export TIKA_CONFIG=/path/to/tika-config.xml

在我的 .bashrc 文件中。

关于configuration - Apache Tika App 配置文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51572684/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com