gpt4 book ai didi

python - 如何按数字拆分 pdf 文本

转载 作者:行者123 更新时间:2023-12-01 07:35:50 25 4
gpt4 key购买 nike

所以我的问题不在于 pdf 提取。假设这是一个 pdf 文本摘录

(a) 这是我的第一段,是一些垃圾文本

(b) 这是另一段,但它顺便提到了另一段,该段涉及第 945(d) 条

(c) 这又是第三段

现在,我尝试创建一个包含 3 个值的列表,每个值代表一个段落。

import re
entire_text = """(a) This is my first paragraph, which is some junk text

(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d) somewhere within this text

(c) This again is is some third paragraph"""
PDF_SUB_SECTIONS = ["(a) ", "(b) ", "(c) ", "(d) ", "(e) ", "(f) ", "(g) "]
regexPattern = '|'.join(map(re.escape,PDF_SUB_SECTIONS))
glSubSections = re.split(regexPattern, entire_text)

我所期望的是['这是我的第一段,是一些垃圾文本',“这是另一段,但它顺便提到了另一段,该段在本文中的某处引用了第 945(d) 条”,'这又是第三段']

我得到的是['这是我的第一段,是一些垃圾文本',“这是另一段,但它顺便提到了另一段,该段提到了第 945 条”,'本文中的某个地方','这又是第三段']

更多信息:1) 第 945(d) 条 - “945”(或任何文本)和“(d”之间永远不会有间隙2)我正在使用PyPDF2提取上面的文本

最佳答案

有几种方法可以使用正则表达式来做到这一点,但通常会变得更复杂,可能不是最好的方法。例如,使用类似于以下的表达式:

^(?:\([^)]+\))\s*(.*)

使用re.findall进行测试

import re

regex = r"^(?:\([^)]+\))\s*(.*)"

test_str = ("(a) This is my first paragraph, which is some junk text\n\n"
"(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)\n\n"
"(c) This again is is some third paragraph")

print(re.findall(regex, test_str, re.MULTILINE))

输出

['This is my first paragraph, which is some junk text', 'This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)', 'This again is is some third paragraph']

使用re.sub进行测试

import re

regex = r"^(?:\([^)]+\))\s*(.*)"

test_str = ("(a) This is my first paragraph, which is some junk text\n\n"
"(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)\n\n"
"(c) This again is is some third paragraph")

subst = "\\1"

print(re.sub(regex, subst, test_str, 0, re.MULTILINE))

使用re.finditer进行测试

import re

regex = r"^(?:\([^)]+\))\s*(.*)"

test_str = ("(a) This is my first paragraph, which is some junk text\n\n"
"(b) This is another paragraph, but it incidentally has some reference to another paragraph which refers to clause 945(d)\n\n"
"(c) This again is is some third paragraph")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1

print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

该表达式在 this demo 的右上角面板中进行了解释,如果您想探索/简化/修改它,请在this link中,如果您愿意,您可以逐步观察它如何与某些示例输入进行匹配。

正则表达式电路

jex.im可视化正则表达式:

enter image description here

关于python - 如何按数字拆分 pdf 文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56996407/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com