gpt4 book ai didi

python:如何计算两个单词列表的余弦相似度?

转载 作者:太空狗 更新时间:2023-10-30 02:30:09 24 4
gpt4 key购买 nike

我想计算两个列表的余弦相似度,如下所示:

A = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory']

B = [u'home (private)', u'school', u'bank', u'shopping mall']

我知道A和B的余弦相似度应该是

3/(sqrt(7)*sqrt(4)).

我尝试将列表改造成像“home bank bank building factory”这样的形式,它看起来像一个句子,但是,有些元素(例如 home(private))本身有空格,有些元素有括号,所以我发现很难计算单词出现的次数。

你知道如何计算这个复杂列表中的单词出现次数吗,这样对于列表B,单词出现次数可以表示为

{'home (private):1, 'school':1, 'bank': 1, 'shopping mall':1}? 

或者你知道如何计算这两个列表的余弦相似度吗?

非常感谢

最佳答案

from collections import Counter

# word-lists to compare
a = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory']
b = [u'home (private)', u'school', u'bank', u'shopping mall']

# count word occurrences
a_vals = Counter(a)
b_vals = Counter(b)

# convert to word-vectors
words = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words] # [0, 0, 1, 1, 2, 1]
b_vect = [b_vals.get(word, 0) for word in words] # [1, 1, 1, 0, 1, 0]

# find cosine
len_a = sum(av*av for av in a_vect) ** 0.5 # sqrt(7)
len_b = sum(bv*bv for bv in b_vect) ** 0.5 # sqrt(4)
dot = sum(av*bv for av,bv in zip(a_vect, b_vect)) # 3
cosine = dot / (len_a * len_b) # 0.5669467

关于python:如何计算两个单词列表的余弦相似度?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28819272/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com