gpt4 book ai didi

python - PDFMiner - 遍历页面并将它们转换为文本

转载 作者:太空宇宙 更新时间:2023-11-03 11:53:04 25 4
gpt4 key购买 nike

因此,我试图从一些 PDF 中获取特定的文本位,并且我将 Python 与 PDFMiner 一起使用,但由于 API 发生了一些变化,在 November 2013 中发生了一些问题。 .基本上,要从 PDF 中获取我想要的文本部分,我目前必须将 整个 文件转换为文本,然后使用字符串函数来获取我想要的部分。我想要做的是遍历 PDF 的每一页,并将每一页一页一页地转换为文本。然后,一旦找到我想要的部分,我就会阻止它阅读该 PDF。

我将发布我的文本编辑器 atm 中的代码,但它不是工作版本,它更像是半途而废的有效解决方案版本:P

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import LTChar, TextConverter
from pdfminer.layout import LAParams
from subprocess import call
from cStringIO import StringIO
import re
import sys
import os

argNum = len(sys.argv)
pdfLoc = str(sys.argv[1]) #CLI arguments

def convert_pdf_to_txt(path): #converts pdf to raw text (not my function)
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)

fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str

if (pdfLoc[-4:] == ".pdf"):
contents = ""
try: # Get the outlines (contents) of the document
fp = open(pdfLoc, 'rb') #open a pdf document for reading
parser = PDFParser(fp)
document = PDFDocument(parser)
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
title = re.sub(r".*\s", "", title) #get raw titles, stripped of formatting
contents += title + "\n"
except: #if pdfMiner can't get contents then manually get contents from text conversion
#contents = convert_pdf_to_txt(pdfLoc)
#startToCpos = contents.find("TABLE OF CONTENTS")
#endToCpos = contents.rfind(". . .")
#contents = contents[startToCpos:endToCpos+8]

fp = open(pdfLoc, 'rb') #open a pdf document for reading
parser = PDFParser(fp)
document = PDFDocument(parser)
pages = PDFPage(document, 3, {'Resources':'thing', 'MediaBox':'Thing'}) #God knows what's going on here
for pageNumber, page in enumerate(pages.get_pages(PDFDocument, fp)): #The hell is the first argument?
if pageNumber == 42:
print "Hello"

#for line in s:
# print line
# if (re.search("(\.\s){2,}", line) and not re.search("NOTES|SCOPE", line)):
# line = re.sub("(\.\s){2,}", "", line)
# line = re.sub("(\s?)*[0-9]*\n", "\n", line)
# line = re.sub("^\s", "", line)
# print line,


#contents = contents.lower()
#contents = re.sub("“", "\"", contents)
#contents = re.sub("”", "\"", contents)
#contents = re.sub("fi", "f", contents)
#contents = re.sub(r"(TABLE OF CONTENTS|LIST OF TABLES|SCOPE|REFERENCED DOCUMENTS|Identification|System (o|O)verview|Document (o|O)verview|Title|Page|Table|Tab)(\n)?|\.\s?|Section|[0-9]", "", contents)
#contents = re.sub(r"This document contains proprietary information and may not be reproduced in any form whatsoever, nor may be used by or its contents divulged to third\nparties without written permission from the ownerAll rights reservedNumber: STP SMEDate: -Jul-Issue: A of CMC STPNHIndustriesCLASSIFICATION\nNATO UNCLASSIFIED AGUSTAEUROCOPTEREUROCOPTER DEUTSCHLAND FOKKER", "", contents)
#contents = re.sub(r"(\r?\n){2,}", "", contents)
#contents = contents.lstrip()
#contents = contents.rstrip()
#print contents
else:
print "Not a valid PDF file"

This is the old way of doing it (或者至少知道旧方法是如何做到的,这个线程对我来说不是很有用 tbh)。但现在我必须使用 PDFPage.get_pages 而不是 PDFDocument.get_pages 并且方法和参数完全不同。

目前,我正在尝试弄清楚我传递给 PDFPage 的 get_pages 方法的“Klass”变量到底是什么? .

如果有人可以阐明 API 的这一部分,甚至提供一个工作示例,我将非常感激。

最佳答案

尝试使用 PyPDF2 .它使用起来要简单得多,而且不像 PDFMiner 那样具有不必要的功能(这对您来说很好)。这就是您想要的,而且实现起来 super 简单。

from PyPDF2 import PdfFileReader

PDF = PdfFileReader(file(pdf_fp, 'rb'))

if PDF.isEncrypted:
decrypt = PDF.decrypt('')
if decrypt == 0:
print "Password Protected PDF: " + pdf_fp
raise Exception("Nope")
elif decrypt == 1 or decrypt == 2:
print "Successfully Decrypted PDF"

for page in PDF.pages:
print page.extractText()
'''page.extractText() is the unicode string of the contents of the page
And I am assuming you know how to play with a string and use regex
If you find what you want just break like so:
if some_condition == True:
break'''

关于python - PDFMiner - 遍历页面并将它们转换为文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21113773/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com