gpt4 book ai didi

python - 使用 python 在句子列表中形成单词的二元组并计算二元组

转载 作者:太空宇宙 更新时间:2023-11-03 14:40:42 25 4
gpt4 key购买 nike

我需要:1. 形成二元对并将它们存储在列表中2. 找到 id 的总和,其中存在频率最高的前 3 个二元组

我有一个句子列表:

[['22574999', 'your message communication sent']
, ['22582857', 'your message be delivered']
, ['22585166', 'message has be delivered']
, ['22585424', 'message originated communication sent']]

这是我所做的:

for row in messages: 
sstrm = list(row)
bigrams=[b for l in sstrm for b in zip(l.split(" ")[:1], l.split(" ")[1:])]
print(sstrm[0],bigrams)

产生:

22574999 [('your', 'message')]
22582857 [('[your', 'message')]
22585166 [('message', 'has')]
22585424 [('message', 'originated')]

我想要的是:

22574999 [('your', 'message'),('communication','sent')]
22582857 [('[your', 'message'),('be','delivered')]
22585166 [('message', 'has'),('be','delivered')]
22585424 [('message', 'originated'),('communication','sent')]

我想要得到以下结果结果:

频率最高的前 3 个二元组:

('your', 'message') :2 
('communication','sent'):2
('be','delivered'):2

其中出现频率最高的前 3 个二元组的 id 总和:

('your', 'message'):2           Is included (22574999,22582857)     
('communication','sent'):2 Is included(22574999,22585424)
('be','delivered'):2 Is included (22582857,22585166)

感谢您的帮助!

最佳答案

我想指出的第一件事是,二元组是两个相邻元素的序列。

例如,“狐狸跳过了懒狗”的二元词是:

[("the", "fox"),("fox", "jumped"),("jumped", "over"),("over", "the"),("the ", "懒"),("懒", "狗")]

这个问题可以使用inverted index来建模,其中二元组是帖子,ID 集是帖子列表。

def bigrams(line):
tokens = line.split(" ")
return [(tokens[i], tokens[i+1]) for i in range(0, len(tokens)-1)]


if __name__ == "__main__":
messages = [['22574999', 'your message communication sent'], ['22582857', 'your message be delivered'], ['22585166', 'message has be delivered'], ['22585424', 'message originated communication sent']]
bigrams_set = set()

for row in messages:
l_bigrams = bigrams(row[1])
for bigram in l_bigrams:
bigrams_set.add(bigram)

inverted_idx = dict((b,[]) for b in bigrams_set)

for row in messages:
l_bigrams = bigrams(row[1])
for bigram in l_bigrams:
inverted_idx[bigram].append(row[0])

freq_bigrams = dict((b,len(ids)) for b,ids in inverted_idx.items())
import operator
top3_bigrams = sorted(freq_bigrams.iteritems(), key=operator.itemgetter(1), reverse=True)[:3]

输出

[(('communication', 'sent'), 2), (('your', 'message'), 2), (('be', 'delivered'), 2)]

尽管此代码可以进行大量优化,但它为您提供了想法。

关于python - 使用 python 在句子列表中形成单词的二元组并计算二元组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46566402/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com