gpt4 book ai didi

java - 波斯语文档的 PDFBOX

转载 作者:行者123 更新时间:2023-11-30 10:12:04 25 4
gpt4 key购买 nike

我想使用 pdfBoxPersian pdf 文件中提取测试,但它为所有波斯字符返回 "?"(它正确返回同一文档中的 Latin 词)。

我该如何解决?有什么建议吗?

最佳答案

遗憾的是,所提供的文件将波斯语文本作为 vector 图形,而不是字体文本,因此无法提取。您必须为此使用 OCR。

另见 text extraction FAQ :

How come I am not getting any text from the PDF document?

Text extraction from a pdf document is a complicated task and there are many factors involved that effect the possibility and accuracy of text extraction. It would be helpful to the PDFBox team if you could try a couple things.

Open the PDF in Acrobat and try to extract text from there. If Acrobat can extract text then PDFBox should be able to as well and it is a bug if it cannot. If Acrobat cannot extract text then PDFBox ‘probably’ cannot either.

It might really be an image instead of text. Some PDF documents are just images that have been scanned in. You can tell by using the selection tool in Acrobat, if you can’t select any text then it is probably an image.

关于java - 波斯语文档的 PDFBOX,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52070656/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com