gpt4 book ai didi

python - 使用 Python 抓取 PDF 文本 (pdfquery)

转载 作者:太空宇宙 更新时间:2023-11-04 04:30:47 28 4
gpt4 key购买 nike

我需要抓取一些 PDF 文件以提取以下文本信息:

enter image description here

我尝试使用 pdfquery 来做到这一点,方法是处理我在 Reddit 上找到的一个示例(参见第一篇文章):https://www.reddit.com/r/Python/comments/4bnjha/scraping_pdf_files_with_python/

我想通过获取许可证号来测试它。我进入生成的“xmltree”文件,找到第一个许可证号并在 LTTextLineHorizo​​ntal 元素中获得 x0、y0、x1、y1 坐标。

import pdfquery
from lxml import etree


PDF_FILE = 'C:\\TEMP\\ad-4070-20-september-2018.pdf'

pdf = pdfquery.PDFQuery(PDF_FILE)
pdf.load(4,5)

with open('xmltree.xml','wb') as f:
f.write(etree.tostring(pdf.tree, pretty_print=True))

product_info = []
page_count = len(pdf._pages)
for pg in range(page_count):
data = pdf.extract([
('with_parent', 'LTPage[pageid="{}"]'.format(pg+1)),
('with_formatter', None),
('product_name', 'LTTextLineHorizontal:in_bbox("89.904, 757.502, 265.7, 770.83")'),
('product_details', 'LTTextLineHorizontal:in_bbox("223, 100, 737, 1114")'),
])
for ix, pn in enumerate(sorted([d for d in data['product_name'] if d.text.strip()], key=lambda x: x.get('y0'), reverse=True)):
product_info.append({'Manufacturer': pn.text.strip(), 'page': pg, 'y_start': float(pn.get('y1')), 'y_end': float(pn.get('y1'))-150})
# if this is not the first product on the page, update the previous product's y_end with a
# value slightly greater than this product's y coordinate start
if ix > 0:
product_info[-2]['y_end'] = float(pn.get('y0'))
# for every product found on this page, find the detail information that falls between the
# y coordinates belonging to the product
for product in [p for p in product_info if p['page'] == pg]:
details = []
for d in sorted([d for d in data['product_details'] if d.text.strip()], key=lambda x: x.get('y0'), reverse=True):
if product['y_start'] > float(d.get('y0')) > product['y_end']:
details.append(d.text.strip())
product['Details'] = ' '.join(details)
pdf.file.close()

for p in product_info:
print('Manufacturer: {}\r\nDetail Info:{}...\r\n\r\n'.format(p['Manufacturer'], p['Details'][0:100]))

但是,当我运行它时,它没有打印任何内容。没有错误,XML 文件生成正常,我直接从 XML 文件中获取坐标,所以应该没有问题。我做错了什么?

最佳答案

要从 PDF 文件中提取文本,我最喜欢的工具是 pdftotext

使用 -layout 选项,您基本上可以得到纯文本,使用 Python 操作起来相对容易。

示例如下:

"""Extract text from PDF files.

Requires pdftotext from the poppler utilities.
On unix/linux install them using your favorite package manager.

Binaries for ms-windows can be found at;
1) VERY OLD 32 bit http://blog.alivate.com.au/poppler-windows/
RECENT 64 bit https://github.com/oschwartz10612/poppler-windows
2) https://sourceforge.net/projects/poppler-win32/
"""

import subprocess


def pdftotext(pdf, page=None):
"""Retrieve all text from a PDF file.

Arguments:
pdf Path of the file to read.
page: Number of the page to read. If None, read all the pages.

Returns:
A list of lines of text.
"""
if page is None:
args = ['pdftotext', '-layout', '-q', pdf, '-']
else:
args = ['pdftotext', '-f', str(page), '-l', str(page), '-layout',
'-q', pdf, '-']
try:
txt = subprocess.check_output(args, universal_newlines=True)
lines = txt.splitlines()
except subprocess.CalledProcessError:
lines = []
return lines

关于python - 使用 Python 抓取 PDF 文本 (pdfquery),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52683133/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com