gpt4 book ai didi

python - 尝试使用 pdfminer.six 提取文本时如何修复 'UnicodeDecodeError'?

转载 作者:行者123 更新时间:2023-11-28 21:41:05 24 4
gpt4 key购买 nike

使用通过 pip install git+https://github.com/pdfminer/pdfminer.six.git 安装的 pdfminer(latest version from git)时出现 UnicodeEncodeError:

Traceback (most recent call last):
File "pdfminer_sample3.py", line 34, in <module>
print(convert_pdf_to_txt("samples/numbers-test-document.pdf"))
File "pdfminer_sample3.py", line 27, in convert_pdf_to_txt
text = retstr.getvalue()
File "/usr/lib/python2.7/StringIO.py", line 271, in getvalue
self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

我该如何解决?

脚本

#!/usr/bin/env python

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from StringIO import StringIO
import codecs

def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()

for page in PDFPage.get_pages(fp, pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)

text = retstr.getvalue()

fp.close()
device.close()
retstr.close()
return text

print(convert_pdf_to_txt("samples/numbers-test-document.pdf"))

示例pdf

https://www.dropbox.com/s/khjfr63o82fa5yn/numbers-test-document.pdf?dl=0

最佳答案

from StringIO import StringIO 替换为 from io import BytesIO

retstr = StringIO() 替换为 retstr = BytesIO()

关于python - 尝试使用 pdfminer.six 提取文本时如何修复 'UnicodeDecodeError'?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45101658/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com