gpt4 book ai didi

Python PDF 挖掘获取每一行文本的位置

转载 作者:行者123 更新时间:2023-12-02 00:58:18 26 4
gpt4 key购买 nike

我目前正在使用此处答案中提供的类(class):

How to extract text and text coordinates from a pdf file?

提供的类非常有用,因为我可以获得 PDF 中每个文本框的位置。每当文本框中有新行时,给定的类也会插入一个“_”。

我想知道是否有某种方法也可以获取文本框中每行文本的位置?

最佳答案

找到它:解决方案是即使有 TextBox 也要递归,直到找到文本行。当调用 parsepdf 方法时,下面的类应该提供 pdf 上每一行文本的 x 和 y 坐标。

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer

class pdfPositionHandling:

def parse_obj(self, lt_objs):

# loop over the object list
for obj in lt_objs:

if isinstance(obj, pdfminer.layout.LTTextLine):
print "%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text().replace('\n', '_'))

# if it's a textbox, also recurse
if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
self.parse_obj(obj._objs)

# if it's a container, recurse
elif isinstance(obj, pdfminer.layout.LTFigure):
self.parse_obj(obj._objs)

def parsepdf(self, filename, startpage, endpage):

# Open a PDF file.
fp = open(filename, 'rb')

# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)

# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)

# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed

# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()

# Create a PDF device object.
device = PDFDevice(rsrcmgr)

# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)

# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)


i = 0
# loop over all pages in the document
for page in PDFPage.create_pages(document):
if i >= startpage and i <= endpage:
# read the page into a layout object
interpreter.process_page(page)
layout = device.get_result()

# extract text from this object
self.parse_obj(layout._objs)
i += 1

关于Python PDF 挖掘获取每一行文本的位置,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31819862/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com