gpt4 book ai didi

python - Word2vec中类比词背后的操作是什么?

转载 作者:太空宇宙 更新时间:2023-11-04 00:11:37 24 4
gpt4 key购买 nike

根据 https://code.google.com/archive/p/word2vec/ :

It was recently shown that the word vectors capture many linguistic regularities, for example vector operations vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen') [3, 1]. You can try out a simple demo by running demo-analogy.sh.

所以我们可以从提供的演示脚本中尝试:

+ ../bin/word-analogy ../data/text8-vector.bin
Enter three words (EXIT to break): paris france berlin

Word: paris Position in vocabulary: 198365

Word: france Position in vocabulary: 225534

Word: berlin Position in vocabulary: 380477

Word Distance
------------------------------------------------------------------------
germany 0.509434
european 0.486505

请注意,paris france berlin 是演示建议的输入提示。问题是,如果我在 Gensim 中打开相同的词向量并尝试自己计算向量,我将无法重现此行为。例如:

>>> word_vectors = KeyedVectors.load_word2vec_format(BIGDATA, binary=True)
>>> v = word_vectors['paris'] - word_vectors['france'] + word_vectors['berlin']
>>> word_vectors.most_similar(np.array([v]))
[('berlin', 0.7331711649894714), ('paris', 0.6669869422912598), ('kunst', 0.4056406617164612), ('inca', 0.4025722146034241), ('dubai', 0.3934606909751892), ('natalie_portman', 0.3909246325492859), ('joel', 0.3843030333518982), ('lil_kim', 0.3784593939781189), ('heidi', 0.3782389461994171), ('diy', 0.3767407238483429)]

那么,类比这个词到底在做什么?我应该如何复制它?

最佳答案

应该只是向量的逐元素加减。和余弦距离找到最相似的。但是,如果您使用原始的 word2vec 嵌入,则“paris”和“Paris”之间存在差异(字符串未降低或词形还原)。

你也可以试试:

v = word_vectors['France'] - word_vectors['Paris'] + word_vectors['Berlin']

v = word_vectors['Paris'] - word_vectors['France'] + word_vectors['Germany']

因为你应该比较相同的概念(城市 - 国家 + 国家 -> 另一个城市)

关于python - Word2vec中类比词背后的操作是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52364632/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com