gpt4 book ai didi

python - 使用 pdfminer.six 从 URL 打开 PDF

转载 作者:行者123 更新时间:2023-12-05 02:53:36 29 4
gpt4 key购买 nike

背景:Python 3.7 & pdfminer.six

使用此处找到的信息:Exporting Data from PDFs with Python ,我有以下代码:

import io

from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage

def extract_text_from_pdf(pdf_path):
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)

with open(pdf_path, 'rb') as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=True):
page_interpreter.process_page(page)

text = fake_file_handle.getvalue()

# close open handles
converter.close()
fake_file_handle.close()

if text:
return text

if __name__ == '__main__':
path = '../_pdfs/mypdf.pdf'
print(extract_text_from_pdf(path))

这有效(耶!),但我真正想做的是直接通过其 url 请求 pdf,而不是打开已预先保存到本地驱动器的 pdf。

我不知道我需要如何修改“with open”逻辑以从远程 url 调用,我也不确定对于最新版本的 Python(requests、urllib、urllib2、等等?)

我是 Python 的新手,所以请记住这一点(P.s. 我发现了其他关于此的问题,但我无能为力 - 可能是因为它们往往很老。)

任何帮助将不胜感激!谢谢!

最佳答案

我是这样解决的:

from io import StringIO, BytesIO
import urllib.request

from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage

def extract_text_from_pdf_url(url, user_agent=None):
resource_manager = PDFResourceManager()
fake_file_handle = StringIO()
converter = TextConverter(resource_manager, fake_file_handle)

if user_agent == None:
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'

headers = {'User-Agent': user_agent}
request = urllib.request.Request(url, data=None, headers=headers)

response = urllib.request.urlopen(request).read()
fb = BytesIO(response)

page_interpreter = PDFPageInterpreter(resource_manager, converter)

for page in PDFPage.get_pages(fb,
caching=True,
check_extractable=True):
page_interpreter.process_page(page)


text = fake_file_handle.getvalue()

# close open handles
fb.close()
converter.close()
fake_file_handle.close()

if text:
# If document has instances of \xa0 replace them with spaces.
# NOTE: \xa0 is non-breaking space in Latin1 (ISO 8859-1) & chr(160)
text = text.replace(u'\xa0', u' ')

return text

关于python - 使用 pdfminer.six 从 URL 打开 PDF,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62157733/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com