所以,我有一个看起来像这样的巨大输入文件:( you can download here )
1. FLO8;PRI2
2. FLO8;EHD3
3. GRI2;BET2
4. HAL4;AAD3
5. PRI2;EHD3
6. QLN3;FZF1
7. QLN3;ABR5
8. FZF1;ABR5
...
把它看成一个两列表,“;”之前的元素显示给“;”之后的元素
我想迭代地打印简单的字符串,以显示构成前馈循环的三个元素。上面的示例编号列表将输出:
"FLO8 PRI2 EHD3"
"QLN3 FZF1 ABR5"
...
将第一行输出解释为前馈循环:
A -> B (FLO8;PRI2)
B -> C (PRI2;EHD3)
A -> C (FLO8;EHD3)
只有这个圈出来的link
所以,我有这个,但它非常慢......有什么建议可以加快实现速度吗?
import csv
TF = []
TAR = []
# READING THE FILE
with open("MYFILE.tsv") as tsv:
for line in csv.reader(tsv, delimiter=";"):
TF.append(line[0])
TAR.append(line[1])
# I WANT A BETTER WAY TO RUN THIS.. All these for loops are killing me
for i in range(len(TAR)):
for j in range(len(TAR)):
if ( TAR[j] != TF[j] and TAR[i] != TF[i] and TAR[i] != TAR[j] and TF[j] == TF[i] ):
for k in range(len(TAR )):
if ( not(k == i or k == j) and TF[k] == TAR[j] and TAR[k] == TAR[i]):
print "FFL: "+TF[i]+ " "+TAR[j]+" "+TAR[i]
注意:我不想要自循环...从 A -> A、B -> B 或 C -> C
我使用集合字典来实现非常快速的查找,如下所示:
编辑:防止自循环:
from collections import defaultdict
INPUT = "RegulationTwoColumnTable_Documented_2013927.tsv"
# load the data as { "ABF1": set(["ABF1", "ACS1", "ADE5,7", ... ]) }
data = defaultdict(set)
with open(INPUT) as inf:
for line in inf:
a,b = line.rstrip().split(";")
if a != b: # no self-loops
data[a].add(b)
# find all triplets such that A -> B -> C and A -> C
found = []
for a,bs in data.items():
bint = bs.intersection
for b in bs:
for c in bint(data[b]):
found.append("{} {} {}".format(a, b, c))
在我的机器上,这会在 0.36 秒内加载数据并在 2.90 秒内找到 1,933,493 个解决方案;结果看起来像
['ABF1 ADR1 AAC1',
'ABF1 ADR1 ACC1',
'ABF1 ADR1 ACH1',
'ABF1 ADR1 ACO1',
'ABF1 ADR1 ACS1',
Edit2: 不确定这是您想要的,但是如果您需要 A -> B 和 A -> C 和 B -> C 而不是 B -> A 或 C -> A 或C -> B,你可以试试
found = []
for a,bs in data.items():
bint = bs.intersection
for b in bs:
if a not in data[b]:
for c in bint(data[b]):
if a not in data[c] and b not in data[c]:
found.append("{} {} {}".format(a, b, c))
但这仍然会返回 1,380,846 个解决方案。
我是一名优秀的程序员,十分优秀!