gpt4 book ai didi

python - 使用 scikit-learn 高效计算余弦相似度

转载 作者:太空宇宙 更新时间:2023-11-04 08:44:25 24 4
gpt4 key购买 nike

在预处理和转换(BOW、TF-IDF)数据后,我需要计算它与数据集中其他元素的余弦相似度。目前,我这样做:

cs_title = [cosine_similarity(a, b) for a in tr_title for b in tr_title]
cs_abstract = [cosine_similarity(a, b) for a in tr_abstract for b in tr_abstract]
cs_mesh = [cosine_similarity(a, b) for a in pre_mesh for b in pre_mesh]
cs_pt = [cosine_similarity(a, b) for a in pre_pt for b in pre_pt]

在这个例子中,每个输入变量,例如tr_title,都是一个SciPy稀疏矩阵。但是,此代码运行非常慢。我可以做些什么来优化代码以使其运行得更快?

最佳答案

要提高性能,您应该用矢量化代码替换列表推导式。这可以通过 Numpy 的 pdist 轻松实现。和 squareform如以下代码片段所示:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform

titles = [
'A New Hope',
'The Empire Strikes Back',
'Return of the Jedi',
'The Phantom Menace',
'Attack of the Clones',
'Revenge of the Sith',
'The Force Awakens',
'A Star Wars Story',
'The Last Jedi',
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(titles)
cs_title = squareform(pdist(X.toarray(), 'cosine'))

演示:

In [87]: X
Out[87]:
<9x21 sparse matrix of type '<type 'numpy.int64'>'
with 30 stored elements in Compressed Sparse Row format>

In [88]: X.toarray()
Out[88]:
array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0],
[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]], dtype=int64)

In [89]: vectorizer.get_feature_names()
Out[89]:
[u'attack',
u'awakens',
u'back',
u'clones',
u'empire',
u'force',
u'hope',
u'jedi',
u'last',
u'menace',
u'new',
u'of',
u'phantom',
u'return',
u'revenge',
u'sith',
u'star',
u'story',
u'strikes',
u'the',
u'wars']

In [90]: np.set_printoptions(precision=2)

In [91]: print(cs_title)
[[ 0. 1. 1. 1. 1. 1. 1. 1. 1. ]
[ 1. 0. 0.75 0.71 0.75 0.75 0.71 1. 0.71]
[ 1. 0.75 0. 0.71 0.5 0.5 0.71 1. 0.42]
[ 1. 0.71 0.71 0. 0.71 0.71 0.67 1. 0.67]
[ 1. 0.75 0.5 0.71 0. 0.5 0.71 1. 0.71]
[ 1. 0.75 0.5 0.71 0.5 0. 0.71 1. 0.71]
[ 1. 0.71 0.71 0.67 0.71 0.71 0. 1. 0.67]
[ 1. 1. 1. 1. 1. 1. 1. 0. 1. ]
[ 1. 0.71 0.42 0.67 0.71 0.71 0.67 1. 0. ]]

注意 X.toarray().shape 产生 (9L, 21L) 因为在上面的玩具示例中有 9 个标题和 21 个不同的词,而 cs_title 是一个 9 x 9 数组。

关于python - 使用 scikit-learn 高效计算余弦相似度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42044770/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com