gpt4 book ai didi

python - Gensim LDA 中的文档主题分布

转载 作者:太空狗 更新时间:2023-10-29 16:56:37 24 4
gpt4 key购买 nike

我使用玩具语料库推导了一个 LDA 主题模型,如下所示:

documents = ['Human machine interface for lab abc computer applications',
'A survey of user opinion of computer system response time',
'The EPS user interface management system',
'System and human system engineering testing of EPS',
'Relation of user perceived response time to error measurement',
'The generation of random binary unordered trees',
'The intersection graph of paths in trees',
'Graph minors IV Widths of trees and well quasi ordering',
'Graph minors A survey']

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = corpora.Dictionary(texts)

id2word = {}
for word in dictionary.token2id:
id2word[dictionary.token2id[word]] = word

我发现,当我使用少量主题来推导模型时,Gensim 会生成测试文档所有潜在主题的主题分布的完整报告。例如:

test_lda = LdaModel(corpus,num_topics=5, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]

Out[314]: [(0, 0.59751626959781134),
(1, 0.10001902477790173),
(2, 0.10001375856907335),
(3, 0.10005453508763221),
(4, 0.10239641196758137)]

但是当我使用大量主题时,报告不再完整:

test_lda = LdaModel(corpus,num_topics=100, id2word=id2word)

test_lda[dictionary.doc2bow('human system')]
Out[315]: [(73, 0.50499999999997613)]

在我看来,输出中省略了概率小于某个阈值(我观察到 0.01 更具体)的主题。

我想知道这种行为是否出于某些审美考虑?我怎样才能得到概率质量残差在所有其他主题上的分布?

感谢您的热心回答!

最佳答案

阅读source事实证明,概率小于阈值的主题会被忽略。该阈值的默认值为 0.01。

关于python - Gensim LDA 中的文档主题分布,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17310933/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com