gpt4 book ai didi

java - Apache TikaParser 抛出无法捕获的异常

转载 作者:行者123 更新时间:2023-12-01 18:03:05 26 4
gpt4 key购买 nike

我目前正在尝试开发一个使用 Apache TikaParser 从不同文件中提取内容的工具。在大多数情况下,一切正常,但在某些文件中,Tika 抛出以下异常:

Mar 09, 2020 11:21:58 AM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$€-2]\ * "-"_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$€-2]\ * "-"_)' with c2: null
at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:191)
at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:193)
at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:167)
at org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:343)
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:901)
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:873)
at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell(FormatTrackingHSSFListener.java:143)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell(ExcelExtractor.java:673)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord(ExcelExtractor.java:447)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord(ExcelExtractor.java:340)
at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord(FormatTrackingHSSFListener.java:92)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.processRecord(ExcelExtractor.java:666)
at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord(HSSFRequest.java:109)
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:178)
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:135)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:316)
at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:169)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at attproc.processors.AttachmentProcessor.run(AttachmentProcessor.java:68)
at attproc.Main.lambda$main$0(Main.java:89)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)

我尝试使用以下代码捕获此异常:

 try {
byte[] content = Files.readAllBytes(path);
try {
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler(-1);
ParseContext parseContext = new ParseContext();
parseContext.set(PDFParserConfig.class, tikaConfig.pdfConfig);

try {
tikaConfig.autoDetectParser.parse(new ByteArrayInputStream(content), handler, metadata, parseContext);
text = Optional.ofNullable(handler.toString()).orElse("");
} catch (Exception ignored) {}

} catch (Exception ignored) {
}

} catch (IOException ignored) {
}

“tikaConfig”是一个单例对象:

public class TikaConfiguration {
private final TikaConfig tikaConfig;
public final PDFParserConfig pdfConfig;
public final Parser autoDetectParser;

private static TikaConfiguration instance;

private TikaConfiguration() throws Exception {
ClassLoader classLoader = getClass().getClassLoader();
InputStream stream = classLoader.getResourceAsStream("tikaconfig.xml");
this.tikaConfig = new TikaConfig(stream);
this.pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(false);

tikaConfig.getDetector();
autoDetectParser = new AutoDetectParser(tikaConfig);
}

public static TikaConfiguration setConfiguration() {
if (TikaConfiguration.instance == null) {
try {
TikaConfiguration.instance = new TikaConfiguration();
} catch (Exception ignored) {}
}

return TikaConfiguration.instance;
}
}

我需要做什么才能捕获这个异常?

最佳答案

看看this有点旧的线程。您所看到的看起来非常相似。它表明 Tika 用于解析 Excel 的 POI 库抛出了警告,而不是错误(您的日志输出也反射(reflect)了这一点)。该警告恰好在其日志记录中包含堆栈跟踪(我认为是由 POI 捕获的,然后传递给 Tika)。

因此,您的代码不会捕获该警告(这不是抛出的异常)。

正如一位评论者在 JIRA 中提到的那样:

I'm not sure this is even a bug. This is the output of the POILogger, not, e.g. printStackTrace().

无论其状态如何,JIRA 中也提出了一种解决方法:运行应用程序时,将 err 流重定向到 null(提供了示例)。

我下载了 JIRA 附带的电子表格,并且能够重新创建您的消息的版本:

WARNING: Invalid format: "_([$Ç-2]\ * #,##0.00_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ * #,##0.00_)' with c2: null
at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:191)
at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:193)
...

但是,我的程序成功完成了。它继续正确生成输出。

关于java - Apache TikaParser 抛出无法捕获的异常,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60599097/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com