gpt4 book ai didi

nlp - Word2Vec 的随机方面是什么?

转载 作者:行者123 更新时间:2023-12-01 12:08:46 25 4
gpt4 key购买 nike

我正在使用 Gensim 对几个不同语料库中的单词进行矢量化,得到的结果让我重新思考 Word2Vec 的功能。我的理解是 Word2Vec 是确定性的,一个词在向量空间中的位置不会因训练而改变。如果“My cat is running”和“your dog can't be running”是语料库中的两个句子,那么“running”(或其词干)的值似乎必然是固定的。

但是,我发现该值确实因模型而异,并且当我训练模型时,单词在向量空间中的位置不断变化。这些差异并不总是意义重大,但它们确实表明存在某种随机过程。我在这里缺少什么?

最佳答案

这在 Gensim FAQ 中有详细介绍,我在这里引用:

Q11: I've trained my Word2Vec/Doc2Vec/etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake? (*2vec training non-determinism)

Answer: The *2vec models (word2vec, fasttext, doc2vec…) begin with random initialization, then most modes use additional randomization during training. (For example, the training windows are randomly truncated as an efficient way of weighting nearer words higher. The negative examples in the default negative-sampling mode are chosen randomly. And the downsampling of highly-frequent words, as controlled by the sample parameter, is driven by random choices. These behaviors were all defined in the original Word2Vec paper's algorithm description.)

Even when all this randomness comes from a pseudorandom-number-generator that's been seeded to give a reproducible stream of random numbers (which gensim does by default), the usual case of multi-threaded training can further change the exact training-order of text examples, and thus the final model state. (Further, in Python 3.x, the hashing of strings is randomized each re-launch of the Python interpreter - changing the iteration ordering of vocabulary dicts from run to run, and thus making even the same string-of-random-number-draws pick different words in different launches.)

So, it is to be expected that models vary from run to run, even trained on the same data. There's no single "right place" for any word-vector or doc-vector to wind up: just positions that are at progressively more-useful distances & directions from other vectors co-trained inside the same model. (In general, only vectors that were trained together in an interleaved session of contrasting uses become comparable in their coordinates.)

Suitable training parameters should yield models that are roughly as useful, from run-to-run, as each other. Testing and evaluation processes should be tolerant of any shifts in vector positions, and of small "jitter" in the overall utility of models, that arises from the inherent algorithm randomness. (If the observed quality from run-to-run varies a lot, there may be other problems: too little data, poorly-tuned parameters, or errors/weaknesses in the evaluation method.)

You can try to force determinism, by using workers=1 to limit training to a single thread – and, if in Python 3.x, using the PYTHONHASHSEED environment variable to disable its usual string hash randomization. But training will be much slower than with more threads. And, you'd be obscuring the inherent randomness/approximateness of the underlying algorithms, in a way that might make results more fragile and dependent on the luck of a particular setup. It's better to tolerate a little jitter, and use excessive jitter as an indicator of problems elsewhere in the data or model setup – rather than impose a superficial determinism.

关于nlp - Word2Vec 的随机方面是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54165109/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com