python - 尝试遍历多个 PDF 文件并在两个搜索条件之间提取文本-6ren

python - 尝试遍历多个 PDF 文件并在两个搜索条件之间提取文本

转载作者：太空宇宙更新时间：2023-11-04 04:36:19

我正在尝试查看多个 PDF 文件，查看每个文件的文本，并提取(开始)“注意 1- 组织”和“注意 2- 组织”(结束)之间的段落。每个文件在这个地方都有不同的文本，我想打印每个文件的每个段落，或者将段落保存到文本文件中。

下面，我整理了一个小脚本，它可以打开一个文件，找到一个文本字符串，并打印找到该文本的页面。我认为这是一个好的开始，但我真的想遍历许多 PDF 文件，查找特定的文本正文，并将找到的所有内容保存到一个文本文件中。

import PyPDF2
import re

# open the pdf file
object = PyPDF2.PdfFileReader("C:/my_path/file1.pdf")

# get number of pages
NumPages = object.getNumPages()

# define keyterms
String = "New York State Real Property Law"

# extract text and do the search
for i in range(0, NumPages):
    PageObj = object.getPage(i)
    print("this is page " + str(i)) 
    Text = PageObj.extractText() 
    # print(Text)
    ResSearch = re.search(String, Text)
    print(ResSearch)

非常感谢任何解决此问题的见解!

最佳答案

如果您的文件名类似于 file1.pdf、file2.pdf 和...，那么您可以使用 for 循环:

import PyPDF2
import re

for k in range(1,100):
    # open the pdf file
    object = PyPDF2.PdfFileReader("C:/my_path/file%s.pdf"%(k))

    # get number of pages
    NumPages = object.getNumPages()

    # define keyterms
    String = "New York State Real Property Law"

    # extract text and do the search
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        print("this is page " + str(i)) 
        Text = PageObj.extractText() 
        # print(Text)
        ResSearch = re.search(String, Text)
        print(ResSearch)

否则你可以使用 os 模块遍历你的文件夹

import PyPDF2
import re
import os

for foldername,subfolders,files in os.walk(r"C:/my_path"):
    for file in files:
        # open the pdf file
        object = PyPDF2.PdfFileReader(os.path.join(foldername,file))

        # get number of pages
        NumPages = object.getNumPages()

        # define keyterms
        String = "New York State Real Property Law"

        # extract text and do the search
        for i in range(0, NumPages):
            PageObj = object.getPage(i)
            print("this is page " + str(i)) 
            Text = PageObj.extractText() 
            # print(Text)
            ResSearch = re.search(String, Text)
            print(ResSearch)

抱歉，如果我错误地识别了您的问题。

编辑:

不幸的是，我不熟悉 pyPDF2 模块，但当您使用此模块转换 pdf 的内容时，似乎会发生一些奇怪的事情(例如额外的换行符或格式更改或...)。

此页面可能有帮助: Extracting text from a PDF file using Python

但是，如果您的文件是 .txt，那么正则表达式会很有帮助

import re
import os
myRegex=re.compile("New York State Real Property Law.*?common elements of the property\.",re.DOTALL)
for foldername,subfolders,files in os.walk(r"C:/Users/Mirana/Me2"):
    for file in files:
        object=open(os.path.join(foldername,file))
        Text=object.read()
        for subText in myRegex.findall(Text):
            print(subText)

object.close()

我也更改了您的 pdf 版本，但导致上述问题的原因至少对我的 pdf 不起作用(试一试):

import PyPDF2
import re
import os

myRegex=re.compile("New York State Real Property Law.*?common elements of the property\.",re.DOTALL)
for foldername,subfolders,files in os.walk(r"C:/my_path"):
    for file in files:
        # open the pdf file
        object = PyPDF2.PdfFileReader(os.path.join(foldername,file))

        # get number of pages
        NumPages = object.getNumPages()

        # extract text and do the search
        for i in range(0, NumPages):
            PageObj = object.getPage(i)
            print("this is page " + str(i)) 
            Text = PageObj.extractText() 
            # print(Text)
        for subText in myRegex.findall(Text):
            print(subText)

关于python - 尝试遍历多个 PDF 文件并在两个搜索条件之间提取文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51621692/

文章推荐： html - 如何在没有固定高度的情况下将条形图条附加到底部

文章推荐： linux - Raspberry pi 2 os.System() 命令返回 0

文章推荐： python - 在python中将字符串转换为日期的快速方法

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 尝试遍历多个 PDF 文件并在两个搜索条件之间提取文本