gpt4 book ai didi

Python PDFMiner : How to link outlines to underlying text

转载 作者:行者123 更新时间:2023-12-04 11:15:34 24 4
gpt4 key购买 nike

我正在尝试解析 PDF 并创建某种层次结构。考虑输入

Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text

Title 1.1
some more text some more text some more text some more text
some more text some more text some more text some more text
some more text some more text

Title 2
some final text some final text
some final text some final text some final text some final text
some final text some final text some final text some final text

这是我如何提取大纲/标题
path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
print (level, title)

这给了我
(1, u'Title 1')
(2, u'Title 1.1')
(1, u'Title 2')

这是完美的,因为级别与文本层次结构对齐。现在我可以提取文本如下
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
text_from_pdf = open('textFromPdf.txt','w')
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
layout = device.get_result()
for element in layout:
if isinstance(element, LTTextBox):
text_from_pdf.write(''.join([i if ord(i) < 128 else ' ' for i in element.get_text()]))

这给了我
Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Title 1.1
some more text some more text some more text some more text
some more text some more text some more text some more text
some more text some more text
Title 2
some final text some final text
some final text some final text some final text some final text
some final text some final text some final text some final text

就订单而言这是可以的,但现在我已经失去了所有的层次感。我怎么知道一个标题在哪里结束,另一个标题在哪里开始?另外,如果有标题/标题,谁是 parent ?

有没有办法连接 outline信息到 layout元素?能够在迭代级别的同时解析所有信息会很棒。

另一个问题是,如果页面底部有任何引文,引文文本就会与结果混合在一起。有没有办法在解析 PDF 时忽略页眉、页脚和引文?

最佳答案

我希望这是可能的,但在 pdfminer 文档中明确说明如下
一些 PDF 文档使用页码作为目标,而其他 PDF 文档使用页码和页面内的物理位置。由于 PDF 没有逻辑结构,并且不提供从外部引用任何页内对象的方法,因此无法准确判断这些目标所引用的文本的哪一部分。
https://pdfminer-docs.readthedocs.io/programming.html#:~:text=Some%20PDF%20documents,are%20referring%20to .
谢谢

关于Python PDFMiner : How to link outlines to underlying text,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46222559/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com