gpt4 book ai didi

java - 使用pdfbox从PDF文件中提取文本

转载 作者:搜寻专家 更新时间:2023-11-01 03:25:42 24 4
gpt4 key购买 nike

我正在尝试使用 pdfbox 从 PDF 文件中提取文本,但不是作为命令行工具,而是在我的 Java 应用程序中。我正在使用 jsoup 下载 pdf。

res = Jsoup
.connect(host+action)
.ignoreContentType(true)
.data(data)
.cookies(cookies)
.method(Method.POST)
.timeout(20*1000)
.execute();

// prepare document
InputStream is = new ByteArrayInputStream(res.bodyAsBytes());
PDDocument pdf = new PDDocument();
pdf.load(is,true);

// extract text
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(pdf);

// print extracted text
System.out.println(text);

这段代码只打印空行。当我这样做时:

System.out.println(res.body());

它打印 pdf 文件输出如下:

%PDF-1.4
%����
6 0 obj
<<
/Filter /FlateDecode
/Length 1869
>>
stream
x��X�n��

...

<<
/Size 28
/Info 27 0 R
/Root 26 0 R
>>
startxref
20632
%%EOF

所以我确定 pdf 已正确下载 - 只是这个 PDF 剥离器不起作用...

--------------------------------------------编辑

此问题已解决 - 工作代码在这里 http://thottingal.in/blog/2009/06/24/pdfbox-extract-text-from-pdf/

最佳答案

(评论中回答的问题。参见 Question with no answers, but issue solved in the comments (or extended in chat) )

@WeloSefer 写道:

maybe this can help you get started ... I have never worked with jsoup nor pdfbox so I am no help but I sure will try pdfbox since I've been testing itextpdf reader for extracting texts.

OP 写道:

Thanks, that is what I was looking for - it works now :) this problem is solved - working code is here http://thottingal.in/blog/2009/06/24/pdfbox-extract-text-from-pdf/

关于java - 使用pdfbox从PDF文件中提取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14354427/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com