gpt4 book ai didi

python - Gensim Fasttext 预训练模型如何获取词汇外单词的向量?

转载 作者:太空狗 更新时间:2023-10-30 01:04:32 25 4
gpt4 key购买 nike

我正在使用 gensim 加载预训练的 fasttext 模型。我从 fasttext website 下载了英文维基百科训练模型。

这是我为加载预训练模型编写的代码:

from gensim.models import FastText as ft
model=ft.load_fasttext_format("wiki.en.bin")

我尝试检查人声中是否存在以下短语(这种情况很少见,因为这些是预训练模型)。

print("internal executive" in model.wv.vocab)
print("internal executive" in model.wv)

False
True

因此词汇表中不存在短语“internal executive”,但我们仍然有与之对应的词向量。

model.wv["internal executive"]
Out[46]:
array([ 0.0210917 , -0.15233646, -0.1173932 , -0.06210957, -0.07288644,
-0.06304111, 0.07833624, -0.17026938, -0.21922196, 0.01146349,
-0.13639058, 0.17283678, -0.09251394, -0.17875175, 0.01339212,
-0.26683623, 0.05487974, -0.11843193, -0.01982722, 0.37037706,
-0.24370994, 0.14269598, -0.16363597, 0.00328478, -0.16560239,
-0.1450972 , -0.24787527, -0.01318423, 0.03277111, 0.16175713,
-0.19367714, 0.16955379, 0.1972683 , 0.09044111, 0.01731548,
-0.0034324 , -0.04834719, 0.14321515, 0.01422525, -0.08803893,
-0.29411593, -0.1033244 , 0.06278021, 0.16452256, 0.0650492 ,
0.1506474 , -0.14194389, 0.10778475, 0.16008648, -0.07853138,
0.2183501 , -0.25451994, -0.0345991 , -0.28843886, 0.19964759,
-0.10923116, 0.26665714, -0.02544454, 0.30637854, 0.04568949,
-0.04798719, -0.05769338, 0.25762403, -0.05158515, -0.04426906,
-0.19901046, 0.00894193, -0.17269588, -0.24747233, -0.19061406,
0.14322804, -0.10804397, 0.4002605 , 0.01409482, -0.04675362,
0.10039093, 0.07260711, -0.0938239 , -0.20434211, 0.05741301,
0.07592541, -0.02921724, 0.21137556, -0.23188967, -0.23164661,
-0.4569614 , 0.07434579, 0.10841205, -0.06514647, 0.01220404,
0.02679767, 0.11840229, 0.2247431 , -0.1946325 , -0.0990666 ,
-0.02524677, 0.0801085 , 0.02437297, 0.00674876, 0.02088535,
0.21464555, -0.16240154, 0.20670174, -0.21640894, 0.03900698,
0.21772243, 0.01954809, 0.04541844, 0.18990673, 0.11806394,
-0.21336791, -0.10871669, -0.02197789, -0.13249406, -0.20440844,
0.1967368 , 0.09804545, 0.1440366 , -0.08401451, -0.03715726,
0.27826542, -0.25195453, -0.16737154, 0.3561183 , -0.15756823,
0.06724873, -0.295487 , 0.28395334, -0.04908851, 0.09448399,
0.10877471, -0.05020981, -0.24595442, -0.02822314, 0.17862654,
0.06452435, -0.15105674, -0.31911567, 0.08166212, 0.2634299 ,
0.17043628, 0.10063848, 0.0687021 , -0.12210461, 0.10803893,
0.13644943, 0.10755012, -0.09816817, 0.11873955, -0.03881042,
0.18548298, -0.04769253, -0.01511982, -0.08552645, -0.05218676,
0.05387992, 0.0497043 , 0.06922272, -0.0089245 , 0.24790663,
0.27209425, -0.04925154, -0.08621719, 0.15918174, 0.25831223,
0.01654229, -0.03617229, -0.13490392, 0.08033483, 0.34922174,
-0.01744722, -0.16894792, -0.10506647, 0.21708378, -0.22582002,
0.15625793, -0.10860757, -0.06058934, -0.25798836, -0.20142137,
-0.06613475, -0.08779443, -0.10732629, 0.05967236, -0.02455976,
0.2229451 , -0.19476262, -0.2720119 , 0.03687386, -0.01220259,
0.07704347, -0.1674307 , 0.2400516 , 0.07338555, -0.2000631 ,
0.13897157, -0.04637206, -0.00874449, -0.32827383, -0.03435039,
0.41587186, 0.04643605, 0.03352945, -0.13700874, 0.16430037,
-0.13630766, -0.18546128, -0.04692861, 0.37308362, -0.30846512,
0.5535561 , -0.11573419, 0.2332801 , -0.07236694, -0.01018955,
0.05936847, 0.25877884, -0.2959846 , -0.13610311, 0.10905041,
-0.18220575, 0.06902339, -0.10624941, 0.33002165, -0.12087796,
0.06742091, 0.20762768, -0.34141317, 0.0884434 , 0.11247049,
0.14748637, 0.13261876, -0.07357208, -0.11968047, -0.22124515,
0.12290633, 0.16602683, 0.01055585, 0.04445777, -0.11142147,
0.00004863, 0.22543314, -0.14342701, -0.23209116, -0.00003538,
0.19272381, -0.13767233, 0.04850799, -0.281997 , 0.10343244,
0.16510887, 0.08671653, -0.24125539, 0.01201926, 0.0995285 ,
0.09807415, -0.06764816, -0.0206733 , 0.04697794, 0.02000999,
0.05817033, 0.10478792, 0.0974884 , -0.01756372, -0.2466861 ,
0.02877498, 0.02499748, -0.00370895, -0.04728201, 0.00107118,
-0.21848503, 0.2033032 , -0.00076264, 0.03828803, -0.2929495 ,
-0.18218371, 0.00628893, 0.20586628, 0.2410889 , 0.02364616,
-0.05220835, -0.07040054, -0.03744286, -0.06718048, 0.19264086,
-0.06490505, 0.27364203, 0.05527219, -0.27494466, 0.22256687,
0.10330909, -0.3076979 , 0.04852265, 0.07411488, 0.23980476,
0.1590279 , -0.26712465, 0.07580928, 0.05644221, -0.18824042],

现在我的困惑是 Fastext 也为单词的字符 ngram 创建向量。因此,对于“内部”一词,它将为其所有字符 ngram 创建向量,包括整个单词,然后该词的最终词向量是其字符 ngram 的总和。

但是,它怎么还能给我一个词甚至整个句子的向量呢? fastertext 向量不是用于单词及其 ngram 吗?那么当它显然是两个词时,我看到的这些向量是什么?

最佳答案

来自论文Enriching Word Vectors with Subword Information :

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations.

因此,词汇外的单词表示为字符 ngram 向量的总和。虽然其目的是处理像“blargfizzle”这样的词汇外单词 (unks),但它也能处理像您输入的短语。

如果你看the implementation of the vectors在 Gensim 中你可以看到这确实是它在做什么(连同规范化和散列等)——我添加了一些以 XXX 开头的评论:

def word_vec(self, word, use_norm=False):
"""
Accept a single word as input.
Returns the word's representations in vector space, as a 1D numpy array.
If `use_norm` is True, returns the normalized word vector.
"""
if word in self.vocab:
# XXX in-vocab terms return with a simple lookup
return super(FastTextKeyedVectors, self).word_vec(word, use_norm)
else:
# from gensim.models.fasttext import compute_ngrams
# XXX Initialize the vector for the unk
word_vec = np.zeros(self.vectors_ngrams.shape[1], dtype=np.float32)
ngrams = _compute_ngrams(word, self.min_n, self.max_n)
if use_norm:
ngram_weights = self.vectors_ngrams_norm
else:
ngram_weights = self.vectors_ngrams
ngrams_found = 0
for ngram in ngrams:
ngram_hash = _ft_hash(ngram) % self.bucket
if ngram_hash in self.hash2index:
# XXX add the vector for the ngram to the unk vector
word_vec += ngram_weights[self.hash2index[ngram_hash]]
ngrams_found += 1
if word_vec.any():
return word_vec / max(1, ngrams_found)
else: # No ngrams of the word are present in self.ngrams
raise KeyError('all ngrams for word %s absent from model' % word)

请注意,这并不意味着它可以为任意字符串提供向量 - 它仍然需要至少有一些 unk 中的 ngram 数据,所以像 xwkxwkzrw 这样的字符串或 天爾遠波如果您的向量是用英语训练的,则可能无法返回任何内容。

关于python - Gensim Fasttext 预训练模型如何获取词汇外单词的向量?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50828314/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com