gpt4 book ai didi

java PdfTextExtractor.getTextFromPage(来源未知)

转载 作者:行者123 更新时间:2023-12-02 08:36:03 25 4
gpt4 key购买 nike

您好,当迭代器到达第 11 页时,我在解析 pdf 时遇到问题,抛出异常。

有什么想法吗?谢谢

这是我的代码:

import java.io.*;
import java.nio.charset.Charset;
import java.util.regex.*;
import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.hyphenation.TernaryTree.Iterator;
import com.lowagie.text.pdf.parser.PdfTextExtractor;

public class PdfParser {
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
int index = 0;
try {
PdfReader readerN = new PdfReader("C:\\Documents and Settings\\stefan.stere\\hibernateWorkspace\\PdfParser\\src\\monitor3.pdf");
OutputStreamWriter out = new OutputStreamWriter( new FileOutputStream(new File("C:\\Documents and Settings\\stefan.stere\\hibernateWorkspace\\PdfParser\\src\\pdf2txt.rtf")),"Cp1252");

PdfTextExtractor parse = new PdfTextExtractor(readerN);
int nrPages = readerN.getNumberOfPages();

for (int i=1; i<nrPages ; i++) {
index++;
String page = parse.getTextFromPage(i);
if(page != null){
page = page.replace(new StringBuffer("null"), new StringBuffer("??"));
page = page.replaceAll("Comercial.", "Comerciala");
page = page.replaceAll("ACT ADI..IONAL", "ACT ADITIONAL");
page = page.replaceAll("HOT.R..E", "HOTARARE");
page = page.replaceAll("HOT.R..EA", "HOTARAREA");
page = page.replaceAll("HOT.R..I", "HOTARARI");
page = page.replaceAll("..cheiat.", "incheiata");
page = page.replaceAll("ANUN..", "ANUNT");
out.write(page);
System.out.println(page);
}
}
out.close();
readerN.close();
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
System.out.println(index);
}
}
}

和异常堆栈:

java.lang.ArrayIndexOutOfBoundsException: Invalid index: 62
at com.lowagie.text.pdf.CMapAwareDocumentFont.decodeSingleCID(Unknown Source)
at com.lowagie.text.pdf.CMapAwareDocumentFont.decode(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.decode(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.displayPdfString(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor$ShowTextArray.invoke(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent(Unknown Source)
at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(Unknown Source)
at PdfParser.main(PdfParser.java:32)

最佳答案

没有答案,但似乎很多人都有同样的问题,SO 上还有另一个相关问题。如果您在谷歌上使用 ArrayIndexOutOfBoundsException 和 getTextFromPage 进行搜索,您会看到同样的问题,但没有解决方案...

顺便说一句,您的循环将在处理最后一页之前停止,因为第一页的索引为 1...

关于java PdfTextExtractor.getTextFromPage(来源未知),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/1761984/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com