gpt4 book ai didi

python - 如何优化 2 个元组列表的组合并删除它们的重复项?

转载 作者:行者123 更新时间:2023-11-28 22:56:41 25 4
gpt4 key购买 nike

从这里开始,How do I remove element from a list of tuple if the 2nd item in each tuple is a duplicate? ,我能够从 1 个元组列表中删除元组中第二个元素的副本。

假设我有 2 个元组列表:

alist = [(0.7897897,'this is a foo bar sentence'),
(0.653234, 'this is a foo bar sentence'),
(0.353234, 'this is a foo bar sentence'),
(0.325345, 'this is not really a foo bar'),
(0.323234, 'this is a foo bar sentence'),]

blist = [(0.64637,'this is a foo bar sentence'),
(0.534234, 'i am going to foo bar this sentence'),
(0.453234, 'this is a foo bar sentence'),
(0.323445, 'this is not really a foo bar')]

如果第二个元素相同(score_from_alist * score_from_blist),我需要合并分数并获得所需的输出:

clist = [(0.51,'this is a foo bar sentence'), # 0.51 = 0.789 * 0.646
(0.201, 'this is not really a foo bar')] # 0.201 = 0.325 * 0.323

目前,我通过这样做实现了 clist,但是当我的 alist 和 blist 有大约 5500 多个元组时,它需要 5 秒以上,其中第二个元素每个有大约 20-40 个单词。有没有办法让下面的函数更快?

def overlapMatches(alist, blist):
start_time = time.time()
clist = []
overlap = set()
for d in alist:
for dn in blist:
if d[1] == dn[1]:
score = d[0]*dn[0]
overlap.add((score,d[1]))
for s in sorted(overlap, reverse=True)[:20]:
clist.append((s[0],s[1]))
print "overlapping matches takes", time.time() - start_time
return clist

最佳答案

我会使用字典/集合来消除重复项并提供快速查找:

alist = [(0.7897897,'this is a foo bar sentence'),
(0.653234, 'this is a foo bar sentence'),
(0.353234, 'this is a foo bar sentence'),
(0.325345, 'this is not really a foo bar'),
(0.323234, 'this is a foo bar sentence'),]

blist = [(0.64637,'this is a foo bar sentence'),
(0.534234, 'i am going to foo bar this sentence'),
(0.453234, 'this is a foo bar sentence'),
(0.323445, 'this is not really a foo bar')]

bdict = {k:v for v,k in reversed(blist)}
clist = []
cset = set()
for v,k in alist:
if k not in cset:
b = bdict.get(k, None)
if b is not None:
clist.append((v * b, k))
cset.add(k)
print(clist)

这里,blist 用于消除每个句子中除第一次出现以外的所有内容,并提供按句子快速查找。

如果你不关心clist的顺序,你可以稍微简化结构:

bdict = {k:v for v,k in reversed(blist)}
cdict = {}
for v,k in alist:
if k not in cdict:
b = bdict.get(k, None)
if b is not None:
cdict[k] = v * b
print(list((k,v) for v,k in cdict.items()))

关于python - 如何优化 2 个元组列表的组合并删除它们的重复项?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15183931/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com