gpt4 book ai didi

python - re.findall 和 re.finditer 的区别——Python 2.7 re 模块中的错误?

转载 作者:行者123 更新时间:2023-11-28 22:52:25 24 4
gpt4 key购买 nike

在演示 Python 的正则表达式功能时,我编写了一个小程序来比较 re.search()re.findall()re 的返回值.finditer()。我知道 re.search() 每行只会找到一个匹配项,而 re.findall() 只会返回匹配的子字符串而不是任何位置信息。然而,令我惊讶的是,匹配的子字符串在这三个函数之间可以不同。

代码(available on GitHub):

#! /usr/bin/env python
# -*- coding: utf-8 -*-

# License: CC-BY-NC-SA 3.0

import re
import codecs

# download kate_chopin_the_awakening_and_other_short_stories.txt
# from Project Gutenberg:
# http://www.gutenberg.org/ebooks/160.txt.utf-8
# with wget:
# wget http://www.gutenberg.org/ebooks/160.txt.utf-8 -O kate_chopin_the_awakening_and_other_short_stories.txt


# match for something o'clock, with valid numerical time or
# any English word with proper capitalization

oclock = re.compile(r"""
(
[A-Z]?[a-z]+ # word mit max. 1 capital letter
| 1[012] # 10,11,12
| [1-9] # 1,2,3,5,6,7,8,9
)
\s
o'clock""",
re.VERBOSE)

path = "kate_chopin_the_awakening_and_other_short_stories.txt"

print
print "re.search()"
print
print u"{:>6} {:>6} {:>6}\t{}".format("Line","Start","End","Match")
print u"{:=>6} {:=>6} {:=>6}\t{}".format('','','','=====')

with codecs.open(path,mode='r',encoding='utf-8') as f:
for lineno, line in enumerate(f):
atime = oclock.search(line)
if atime:
print u"{:>6} {:>6} {:>6}\t{}".format(lineno,
atime.start(),
atime.end(),
atime.group())


print
print "re.findall()"
print
print u"{:>6} {:>6} {:>6}\t{}".format("Line","Start","End","Match")
print u"{:=>6} {:=>6} {:=>6}\t{}".format('','','','=====')
with codecs.open(path,mode='r',encoding='utf-8') as f:
for lineno, line in enumerate(f):
times = oclock.findall(line)
if times:
print u"{:>6} {:>6} {:>6}\t{}".format(lineno,
'',
'',
' '.join(times))


print
print "re.finditer()"
print
print u"{:>6} {:>6} {:>6}\t{}".format("Line","Start","End","Match")
print u"{:=>6} {:=>6} {:=>6}\t{}".format('','','','=====')
with codecs.open(path,mode='r',encoding='utf-8') as f:
for lineno, line in enumerate(f):
times = oclock.finditer(line)
for m in times:
print u"{:>6} {:>6} {:>6}\t{}".format(lineno,
m.start(),
m.end(),
m.group())

和输出(在 Python 2.7.3 和 2.7.5 上测试):

re.search()

Line Start End Match
====== ====== ====== =====
248 7 21 eleven o'clock
1520 24 35 one o'clock
1975 21 33 nine o'clock
2106 4 16 four o'clock
4443 19 30 ten o'clock

re.findall()

Line Start End Match
====== ====== ====== =====
248 eleven
1520 one
1975 nine
2106 four
4443 ten

re.finditer()

Line Start End Match
====== ====== ====== =====
248 7 21 eleven o'clock
1520 24 35 one o'clock
1975 21 33 nine o'clock
2106 4 16 four o'clock
4443 19 30 ten o'clock

我在这里遗漏了什么?为什么 re.findall() 不返回 o'clock 位?

最佳答案

根据 re.findall documentation :

... If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

pattern 只包含一组; findall 返回组的列表。


>>> import re
>>> re.findall('abc', 'abc')
['abc']
>>> re.findall('a(b)c', 'abc')
['b']
>>> re.findall('a(b)(c)', 'abc')
[('b', 'c')]

使用括号的非捕获版本:

>>> re.findall('a(?:b)c', 'abc')
['abc']

关于python - re.findall 和 re.finditer 的区别——Python 2.7 re 模块中的错误?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20424661/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com