gpt4 book ai didi

python 在文件中查找正则表达式匹配次数最多的部分

转载 作者:太空宇宙 更新时间:2023-11-04 02:24:12 29 4
gpt4 key购买 nike

我每行读取一个文件行并检查代码部分结束的位置:出现特定的字符序列。该序列可能出现在代码部分中,因此我必须检查冗余:连续行中有多少次包含该序列。连续出现 10 次,我应该返回连续出现开始检测代码部分结尾的第一行。

regexp_dict_02 = {'Name': 'EMPTY_PAGES', 'Expr':  '(FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF)'}
def FindEmptyPg(Inpfile,Section):

NbrLine = []
Pos = []
flag = 0
index = 0
Ln = 0

with open(Inpfile) as fp:
for i, line in enumerate(fp):
if i >= Section.startline and i < 30061 :
s=re.search(regexp_dict_02['Expr'],line)
if s:
NbrLine.append(i)


logging.info (NbrLine)
logging.info (len(NbrLine))
for index in range((len(NbrLine))-1):
if NbrLine[index+1] - NbrLine[index] == 1 :
logging.info (str (NbrLine[index+1]) + ' ' + str(NbrLine[index]))
Pos.append (index)
flag += 1
if flag == 5 :
Ln = NbrLine[Pos[0]]
break
logging.info (Pos)
return Ln

enter image description here

在上面的代码中,我只检查了两个连续的行,我得到了错误的行号。我避免使用状态机等复杂的东西,但我仍然卡住了。

最佳答案

这是一种解决方案。以下代码遍历每一行。每次找到匹配项时,它都会将行索引添加到 block 中。一旦找到没有任何匹配项的行,该 block 将被视为“关闭”并创建一个新的空 block ,但在此之前,它会保存 block 的长度和 results 中的第一个索引。这些是您唯一感兴趣的信息。最后,您对 results 进行排序并选择最后一项(元组列表的排序将默认按元组的第一项排序,在本例中为 len block 的),一个元组,其中包含找到的最长 block 以及该 block 第一行的索引。

t = \
'''
000010000000000000000000000000000000000011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
000010000000000000000000000000000000000011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
000010000000000000000000000000000000000011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
000010000000000000000000000000000000000011111
000010000000000000000000000000000000000011111
000010000000000000000000000000000000000011111
'''

pattern = 'FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF'
block = []
results = []
for i, line in enumerate(t.split('\n')):
if pattern in line:
block.append(i)
else:
try:
results.append((len(block), block[0])) #save the len and the first index of each block
block = []
except IndexError:
pass


cons, index = sorted(results)[-1] #number of consecutive match, line index
print(f'max consecutive matches found: {cons} , stating at line {index}')

输出:

max consecutive matches found: 14 , stating at line 11

解决评论:

I need the first sufficient successive occurrences: first 10 successive occurrences matched then I catch the line.

您可以改用以下代码。

pattern = 'FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF'
block = []
for i, line in enumerate(t.split('\n')):
if pattern in line:
block.append(i)
else:
if len(block) >= 10:
print(f'found a block of at least 10 lines starting from line {block[0]}')
break
block = []

关于python 在文件中查找正则表达式匹配次数最多的部分,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50874061/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com