gpt4 book ai didi

pdf - 使用 PDFBox 获取每一行的字体

转载 作者:行者123 更新时间:2023-12-02 13:56:45 48 4
gpt4 key购买 nike

有没有办法使用PDFBox获取PDF文件每一行的字体?我已经尝试过,但它只列出了该页面中使用的所有字体。它不显示该字体中显示的行或文本。

List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
for(PDPage page:pages)
{
Map<String,PDFont> pageFonts=page.getResources().getFonts();
for(String key : pageFonts.keySet())
{
System.out.println(key+" - "+pageFonts.get(key));
System.out.println(pageFonts.get(key).getBaseFont());
}
}

欢迎任何意见。谢谢!

最佳答案

每当您尝试使用 PDFBox 从 PDF 中提取文本(纯文本或带有样式信息)时,通常应该开始尝试使用 PDFTextStripper类或其亲属之一。该类已经为您完成了 PDF 内容解析中涉及的所有繁重工作。

您使用简单的 PDFTextStripper像这样的类:

PDDocument document = ...;
PDFTextStripper stripper = new PDFTextStripper();
// set stripper start and end page or bookmark attributes unless you want all the text
String text = stripper.getText(document);

这仅返回纯文本,例如来自一些 R40 形式:

Claim for repayment of tax deducted 
from savings and investments
How to fill in this form
Please fill in this form with details of your income for the
above tax year. The enclosed Notes will help you (but there is
not a note for every box on the form). If you need more help
with anything on this form, please phone us on the number
shown above.
If you are not a UK resident, do not use this form – please
contact us.
Please do not send us any personal records, or tax
certificates or vouchers with your form. We will contact
you if we need these.
Please allow four weeks before contacting us about your
repayment. We will pay you as quickly as possible.
Use black ink and capital letters
Cross out any mistakes and write the
correct information below
...

另一方面,您可以覆盖其方法 writeString(String, List<TextPosition>)并处理比单纯文本更多的信息。要在字体发生变化的地方添加有关所用字体名称的信息,您可以使用以下命令:

PDFTextStripper stripper = new PDFTextStripper() {
String prevBaseFont = "";

protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
StringBuilder builder = new StringBuilder();

for (TextPosition position : textPositions)
{
String baseFont = position.getFont().getBaseFont();
if (baseFont != null && !baseFont.equals(prevBaseFont))
{
builder.append('[').append(baseFont).append(']');
prevBaseFont = baseFont;
}
builder.append(position.getCharacter());
}

writeString(builder.toString());
}
};

对于您获得的同一张表格

[DHSLTQ+IRModena-Bold]Claim for repayment of tax deducted 
from savings and investments
How to fill in this form
[OIALXD+IRModena-Regular]Please fill in this form with details of your income for the
above tax year. The enclosed Notes will help you (but there is
not a note for every box on the form). If you need more help
with anything on this form, please phone us on the number
shown above.
If you are not a UK resident, do not use this form – please
contact us.
[DHSLTQ+IRModena-Bold]Please do not send us any personal records, or tax
certificates or vouchers with your form. We will contact
you if we need these.
[OIALXD+IRModena-Regular]Please allow four weeks before contacting us about your
repayment. We will pay you as quickly as possible.
Use black ink and capital letters
Cross out any mistakes and write the
correct information below
...

如果您不希望字体信息与文本合并,只需在覆盖方法中创建单独的结构即可。

TextPosition提供了有关它所代表的文本片段的更多信息。检查一下!

关于pdf - 使用 PDFBox 获取每一行的字体,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21705961/

48 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com