Python PDFMiner : How to link outlines to underlying text-6ren

Python PDFMiner : How to link outlines to underlying text

转载作者：行者123 更新时间：2023-12-04 11:15:34

24

4

我正在尝试解析 PDF 并创建某种层次结构。考虑输入

Title 1
some text some text some text some text some text some text some text 
some text some text some text some text some text some text some text 

Title 1.1
some more text some more text some more text some more text 
some more text some more text some more text some more text 
some more text some more text 

Title 2
some final text some final text 
some final text some final text some final text some final text 
some final text some final text some final text some final text

这是我如何提取大纲/标题

path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
    print (level, title)

这给了我

(1, u'Title 1')
(2, u'Title 1.1')
(1, u'Title 2')

这是完美的，因为级别与文本层次结构对齐。现在我可以提取文本如下

if not document.is_extractable:
    raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
text_from_pdf = open('textFromPdf.txt','w')
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    layout = device.get_result()
    for element in layout:
        if isinstance(element, LTTextBox):
            text_from_pdf.write(''.join([i if ord(i) < 128 else ' ' for i in element.get_text()]))

这给了我

Title 1
some text some text some text some text some text some text some text 
some text some text some text some text some text some text some text 
Title 1.1
some more text some more text some more text some more text 
some more text some more text some more text some more text 
some more text some more text 
Title 2
some final text some final text 
some final text some final text some final text some final text 
some final text some final text some final text some final text

就订单而言这是可以的，但现在我已经失去了所有的层次感。我怎么知道一个标题在哪里结束，另一个标题在哪里开始？另外，如果有标题/标题，谁是 parent ？

有没有办法连接 outline信息到 layout元素？能够在迭代级别的同时解析所有信息会很棒。

另一个问题是，如果页面底部有任何引文，引文文本就会与结果混合在一起。有没有办法在解析 PDF 时忽略页眉、页脚和引文？

最佳答案

我希望这是可能的，但在 pdfminer 文档中明确说明如下
一些 PDF 文档使用页码作为目标，而其他 PDF 文档使用页码和页面内的物理位置。由于 PDF 没有逻辑结构，并且不提供从外部引用任何页内对象的方法，因此无法准确判断这些目标所引用的文本的哪一部分。
https://pdfminer-docs.readthedocs.io/programming.html#:~:text=Some%20PDF%20documents,are%20referring%20to .
谢谢

关于Python PDFMiner : How to link outlines to underlying text，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46222559/

24

4

0

文章推荐： r - 将日期转换为 POSIXct

文章推荐： amazon-web-services - 用于访问整个文件夹的 s3 预签名 url

文章推荐： CMake 错误 : execution of make failed on Windows

bash - 从 VS Code 中提取 OUTLINE(或 AL OUTLINE)的内容
是否有任何 bash 命令可以提取 VS Code 的 OUTLINE 或 AL OUTLINE 部分的内容并将其写入某些文本文档？最佳答案如果你没有得到更好的答案，你可以试试 Show Func
html - Chrome : Parts of outline on links remain after focus removed (when outline-style is explicitly to "auto")
对此可能没有直接的解决方案，但欢迎提出任何解决方法的建议或指向此问题/行为的某些文档的指针。场景:如果我应用自定义焦点 CSS 样式，例如: :focus {outline: 2px auto gr
css - CSS outline-color 属性在 firefox 中不起作用吗？例如: input:focus {outline-color:#9aadee; }
.login-box input:focus{ outline:1px solid #9aadee; - 这将在 Firefox 中运行，但在 Chrome 中不起作用 .login-box inpu
css - CSS outline-color 属性在 firefox 中不起作用吗？例如: input:focus {outline-color:#9aadee; }
.login-box input:focus{ outline:1px solid #9aadee; - 这将在 Firefox 中运行，但在 Chrome 中不起作用 .login-box inpu
ios - "outline"不是有效的样式属性
我正在尝试使用 reactjs 重用为 Web 应用程序构建的一些组件与 radium . 我有一个包含 outline css 属性的组件。我重用了这个组件，不幸的是，我得到了这个错误: 我的问题是
r - outline=FALSE 使用什么方法来确定异常值？
This question already has answers here: In ggplot2, what do the end of the boxplot lines represent?
java - 节点列中带有复选框的 Swing Outline
我正在创建一个包含房间的 JOutline，每个房间内都有多个产品。您可以选择单个产品并点击详细信息，但我还需要能够使用复选框选择多个产品旁边的复选框。我特别寻找一种将复选框放在对象最左侧的方法。
firefox - 在firefox中修改div onclick outline
在 Firefox 中，只要单击一个链接，该项目周围就会出现一个虚线框。是否可以修改它以便我可以选择突出显示的 div，或者自定义它勾勒出的区域？最佳答案我想你会想看看Removing Dott
android - Outlined TextInputLayout 渲染不正确
使用新的 Material Design 指南，我正在尝试创建一个带轮廓的文本字段。结果没有轮廓或外观变化。 Android Studio 会抛出渲染问题。 '无法解析资源@stri
android - Outlined MaterialButton 不显示任何边框
我正在尝试构建一个带有边框按钮的布局，如下所示(预期行为)。从 Material design 文档中，我了解到 Outlined Material Button 似乎非常适合我的目的。我在我的布局中
ios - iOS中有没有办法设置图像 "outline"的颜色？
这个问题在这里已经有了答案: Using Tint color on UIImageView (9 个回答) 关闭 8 年前。
html - 什么是好的 HTML5 Outliner？
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。我们不允许提问寻求书籍、工具、软件库等的推荐。您可以编辑问题，以便用事实和引用来回答。关闭 7 年前。
reactjs - Material-UI Outlined 输入标签对齐不正确
我遇到了一个问题，TextField 标签和占位符文本在 Material UI 中呈现不正确。我不确定为什么会发生这种情况，因为我直接从 Material-UI 演示中复制并粘贴了。我试过通过阅读其
ios - 更改 MDCTextField Outlined 文本字段的事件边框颜色
my textfield 如何将 MDCTextField 的轮廓颜色从紫色更改为 .systemBlue？最佳答案这是一个适合我的解决方案 let textField = MDCOutlined
vector - svg: 生成 'outline path'
我有一组坐标，可以将其转换为 svg 路径(使用三次贝塞尔曲线使其平滑)。当我应用一定的笔触宽度时，我得到以下结果(蓝点是我的坐标) 我感兴趣的是获得一条围绕灰色形状运行的路径 (例如:选择灰色/白色
visual-studio - VS2010相当于Eclipse的 "Outline"窗口拖动重构
我正在寻找Visual Studio中与Eclipse的 Outline View 中的拖放功能等效的功能。具体而言，在您在“大纲 View ”中打开要编辑的类之后，您会看到类中的所有方法以它们所处的
python - 将对象类型放入 Outliner Maya 中的组中
我尝试将每个元素的所有对象类型放入大纲 View 中的组中。这是我的代码。 from maya import cmds objects = cmds.ls(selection=True, dag=T
java - Eclipse IDE - 'Outline' 功能的源代码
我实际上正在开发一个工具，它可以计算类中的方法和 if 语句的数量。目的是估计要编写的测试用例的数量。我注意到 Eclipse 大纲框有一些有趣的信息，如果能获得生成大纲信息的代码，那就太好了。我
安卓 L - android.graphics.outline
根据 L 开发者预览版的 android 开发者页面，可以使用 Outline 类并为 View 定义轮廓以正确显示阴影。( http://developer.android.com/preview/
javascript - 如何在不使用 CSS outline 属性的情况下设置轮廓？
我想用 Javascript 和 CSS 设置鼠标悬停元素的轮廓。在 chrome 中，CSS outline 属性运行良好，但在 Internet Explorer 中(我使用 IE9)则不行。

首页

博学

6Ren·AI

商城

Python PDFMiner : How to link outlines to underlying text