gpt4 book ai didi

python - 使用 gensim 学习的打印二元组

转载 作者:行者123 更新时间:2023-12-01 08:36:35 25 4
gpt4 key购买 nike

我想使用 gensim 从语料库中学习二元组,然后打印学到的二元组。我还没有看到这样做的例子。感谢帮助

from gensim.models import Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream)

# how can I print all bigrams learned and just the bigrams, including "new_york" and "human computer" ?enter code here

最佳答案

如果您使用提到的类 Phrases 来训练模型并打印二元组而不保留模型,那么 OP 的答案将会起作用。当您保存模型并在将来再次加载时,它将不起作用。保存模型后加载模型时,您需要使用 Phraser 类,如下所示:

from gensim.models.phrases import Phraser

然后加载模型:

bigram_model = Phraser.load('../../whatever_bigram_model')

然后,如果您确实使用以下方法作为提到的OP的答案,即

OP回答

import operator
sorted(
{k:v for k,v in bigram_model.vocab.items() if b'_' in k if v>=bigram_model.min_count}.items(),
key=operator.itemgetter(1),
reverse=True)

您将收到一条错误消息:

AttributeError: 'Phraser' object has no attribute 'vocab'

解决方案

解决这个问题的方法是下面的代码:

for bigram in bigram_model.phrasegrams.keys():
print(bigram)

输出:

(b'word1', b'word2')
(b'word3', b'word4')

此解决方案适用于两种情况,对于持久模型和非持久模型,在OP给出的示例中,我的解决方案的修改版本是:

for ngrams, _ in bigram.vocab.items():
unicode_ngrams = ngrams.decode('utf-8')
if '_' in unicode_ngrams:
print(unicode_ngrams)

给予:

the_mayor
mayor_of
of_new
new_york
york_was
was_there
human_computer
computer_interaction
interaction_and
and_machine
machine_learning
learning_has
has_now
now_become

输出中有更多内容,但为了这个答案的长度,我截断了它

我希望我的回答有助于澄清。

关于python - 使用 gensim 学习的打印二元组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53694381/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com