gpt4 book ai didi

python - 使用正则表达式从目录中解析文本

转载 作者:行者123 更新时间:2023-11-30 22:20:19 25 4
gpt4 key购买 nike

以下是我要解析的文本,存储在名为“toc”的变量中

                                  Table of Contents
I. INTRODUCTION .................................... 1
II. FACTUAL ASPECTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
A. The Clean Air Act . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
B. EPA's Gasoline Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1. Establishment of Baselines . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Reformulated Gasoline . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. Conventional Gasoline (or "Anti-Dumping Rules") . . . . . . . . 4
C. The May 1994 Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
III. MAIN ARGUMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
A. General .................................... 5
B. The General Agreement on Tariffs and Trade . . . . . . . . . . . . . . . . 6
1. Article I - General Most-Favoured-Nation Treatment . . . . . . . 6
2. Article III - National Treatment on Internal Taxation
and Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
a) Article III:4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
b) Article III:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3. Article XX - General Exceptions . . . . . . . . . . . . . . . . . . . . 15
4. Article XX(b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
a) "Protection of Human, Animal and Plant Life
or Health" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
b) "Necessary" . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5. Article XX(d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6. Article XX(g) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
a) "Related to the conservation of exhaustible natural
resources..." . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
b) "... made effective in conjunction with restrictions
on domestic production or consumption" . . . . . . . . . . 23
7. Preamble to Article XX . . . . . . . . . . . . . . . . . . . . . . . . . . 23
8. Article XXIII - Nullification and Impairment . . . . . . . . . . . . 25

我想要这样的结果:

['I.INTRODUCTION ...... 1', 'A. The Clean Air Act ....3', 'B. EPA\'s Gasoline Rule ... 3', (AND_SO_ON) ]

输入:

re.search(r"((?<=(\n))\s+(?P<name>[A-Z \.]*?)(\n))", toc_s).group() 

输出:

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-64-4aa240f6e378> in <module>()
----> 1 re.search(r"((?<=(\n))\s+(?P<name>[A-Z \.]*?)(\n))", toc_s).group()

AttributeError: 'NoneType' object has no attribute 'group'

我的问题是什么?

最佳答案

假设整个 TOC 内容位于多行字符串 text 中。您可以在启用 re.MULTILINE 开关的情况下使用 re.findallre.finditer

for match in re.finditer('(.*?)[\W]+(\d+)(?=\n|$)', text, flags=re.M):
chapter, page = map(str.strip, match.groups())
... # do something with these

或者,

contents = re.findall('(.*?)[\W]+(\d+)(?=\n|$)', text, flags=re.M)

这会返回一些类似的内容 -

[('I.   INTRODUCTION', '1'),
('II. FACTUAL ASPECTS', '2'),
(' A. The Clean Air Act', '3'),
(" B. EPA's Gasoline Rule", '3'),
(' 1. Establishment of Baselines', '3'),
(' 2. Reformulated Gasoline', '4'),
...
]

二元组列表。每个元组都有 a) 章节和 b) 相应的页码。如果一行与模式不匹配,它当然会被忽略。

详细信息

该模式非常具体,需要进行一些尝试和错误。

(         # first capture group - the chapter name
.*? # non-greedy match
)
[\W]+ # match characters that are not alphanumeric
( # second capture group - the page number
\d+ # one or more digits
)
(?= # lookahead for a newline or EOL (multiline)
\n # literal newline
| # regex OR
$ # EOL
)

关于python - 使用正则表达式从目录中解析文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48878848/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com