gpt4 book ai didi

python - 为什么我在 Gensim word2vec 中得到单字母词汇表?

转载 作者:太空狗 更新时间:2023-10-30 01:11:55 24 4
gpt4 key购买 nike

我正在构建一个 word2vec 模型,如下所示。

from gensim.models import word2vec, Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, delimiter=b' ')
trigram = Phrases(bigram[sentence_stream], min_count=1, delimiter=b' ')

for sent in sentence_stream:
bigrams_ = bigram[sent]
trigrams_ = trigram[bigram[sent]]

print(bigrams_)
print(trigrams_)


# Set values for various parameters
num_features = 10 # Word vector dimensionality
min_word_count = 1 # Minimum word count
num_workers = 4 # Number of threads to run in parallel
context = 5 # Context window size
downsampling = 1e-3 # Downsample setting for frequent words


model = word2vec.Word2Vec(trigrams_, workers=num_workers, \
size=num_features, min_count = min_word_count, \
window = context, sample = downsampling)

vocab = list(model.wv.vocab.keys())
print(vocab[:10])

但是,我得到的模型词汇表的输出是单个字符,如下所示。

['h', 'u', 'm', 'a', 'n', ' ', 'c', 'o', 'p', 't']

我正确地得到了二元组和三元组。因此,我只是对代码错误的地方感到困惑。请告诉我问题是什么?

最佳答案

这解决了我的问题。我应该按如下方式将列表列表传递给 word2vec 模型。

trigram_sentences_project = []


bigram = Phrases(sentence_stream, min_count=1, delimiter=b' ')
trigram = Phrases(bigram[sentence_stream], min_count=1, delimiter=b' ')


for sent in sentence_stream:
#bigrams_ = [b for b in bigram[sent] if b.count(' ') == 1]
#trigrams_ = [t for t in trigram[bigram[sent]] if t.count(' ') == 2]
bigrams_ = bigram[sent]
trigrams_ = trigram[bigram[sent]]
trigram_sentences_project.append(trigrams_)

关于python - 为什么我在 Gensim word2vec 中得到单字母词汇表?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46441876/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com