python - 查找双线；更快的方法-6ren

python - 查找双线；更快的方法

转载作者：行者123 更新时间：2023-11-28 22:37:47

25

4

这是我在文本文件中查找所有双行的方法

import regex #regex is as re
#capture all lines in buffer
r = f.readlines()
#create list of all linenumbers
lines = list(range(1,endline+1))
#merge both lists
z=[list(a) for a in zip(r, lines)]

#sort list
newsorting = sorted(z)

#put doubles in list
listdoubles = []
for i in range(0,len(newsorting)-1):
    if (i+1) <= len(newsorting):
        if (newsorting[i][0] == newsorting[i+1][0]) and (not regex.search('^\s*$',newsorting[i][0])):
                listdoubles.append(newsorting[i][1])
                listdoubles.append(newsorting[i+1][1])

#remove event. double linenumbers
listdoubles = list(set(listdoubles))
#sort line numeric
listdoubles = sorted(listdoubles, key=int)
print(listdoubles)

但是速度很慢。当我有超过 10.000 行时，创建此列表需要 10 秒。

有什么方法可以更快吗？

最佳答案

您可以使用更简单的方法:

对于每一行
如果以前看过就显示它
否则将其添加到已知线路集中

在代码中:

seen = set()
for L in f:
    if L in seen:
        print(L)
    else:
        seen.add(L)

如果你想显示出现重复的行号，代码可以简单地更改为使用字典映射行内容到行号，它的文本是第一次看到的:

seen = {}
for n, L in enumerate(f):
    if L in seen:
        print("Line %i is a duplicate of line %i" % (n, seen[L]))
    else:
        seen[L] = n

Python 中的dict 和set 都是基于散列的，并提供恒定时间的查找操作。

编辑

如果您只需要一行的最后一个副本的行号，那么在处理过程中显然无法完成输出，但您必须先处理整个输入，然后再发出任何输出...

# lastdup will be a map from line content to the line number the
# last duplicate was found. On first insertion the value is None
# to mark the line is not a duplicate
lastdup = {}
for n, L in enumerate(f):
    if L in lastdup:
        lastdup[L] = n
    else:
        lastdup[L] = None

# Now all values that are not None are the last duplicate of a line
result = sorted(x for x in lastdup.values() if x is not None)

关于python - 查找双线；更快的方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36233149/

25

4

0

文章推荐： python - 结果中的 "u"是什么(Python)？

文章推荐： tomcat - 如何增加 JVM 堆大小 YouTrack 5.1

文章推荐： java - 在业务逻辑之前执行方法

Nginx反向代理+DNS轮询+IIS7.5 千万PV 百万IP 双线网站架构案例
Nginx ("engine x") 是一个高性能的 HTTP 和反向代理服务器，也是一个 IMAP/POP3/SMTP 代理服务器。 Nginx 是由 I

首页

博学

6Ren·AI

商城

python - 查找双线；更快的方法

编辑