gpt4 book ai didi

text-mining - gensim 的 get_document_topics 方法返回的概率不等于 1

转载 作者:行者123 更新时间:2023-12-04 03:09:55 26 4
gpt4 key购买 nike

有时它返回所有主题的概率,一切都很好,但有时它只返回几个主题的概率,它们加起来不等于一个,这似乎取决于文档。一般来说,当它返回的主题很少时,概率加起来或多或少 80%,那么它是否只返回最相关的主题?有没有办法强制它返回所有概率?

也许我遗漏了一些东西,但我找不到该方法参数的任何文档。

最佳答案

我遇到了同样的问题并通过包含参数 minimum_probability=0 解决了它调用 get_document_topicsgensim.models.ldamodel.LdaModel的方法对象。

    topic_assignments = lda.get_document_topics(corpus,minimum_probability=0)

默认情况下, gensim 不输出低于 0.01 的概率,因此,特别是对于任何文档,如果有任何主题分配的概率低于此阈值,则该文档的主题概率之和不会加起来为 1。

这是一个例子:

from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel

# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

# Train the model on the corpus.
lda = LdaModel(common_corpus, num_topics=100)

# Try values of minimum_probability argument of None (default) and 0
for minimum_probability in (None, 0):
# Get topic probabilites for each document
topic_assignments = lda.get_document_topics(common_corpus,minimum_probability=minimum_probability)
probabilities = [ [entry[1] for entry in doc] for doc in topic_assignments ]
# Print output
print(f"Calculating topic probabilities with minimum_probability argument = {str(minimum_probability)}")
print(f"Sum of probabilites:")
for i, P in enumerate(probabilities):
sum_P = sum(P)
print(f"\tdoc {i} = {sum_P}")

输出将是:
Calculating topic probabilities with minimum_probability argument = None
Sum of probabilities:
doc 0 = 0.6733324527740479
doc 1 = 0.8585712909698486
doc 2 = 0.7549994885921478
doc 3 = 0.8019999265670776
doc 4 = 0.7524996995925903
doc 5 = 0
doc 6 = 0
doc 7 = 0
doc 8 = 0.5049992203712463
Calculating topic probabilities with minimum_probability argument = 0
Sum of probabilites:
doc 0 = 1.0000000400468707
doc 1 = 1.0000000337604433
doc 2 = 1.0000000079162419
doc 3 = 1.0000000284053385
doc 4 = 0.9999999937135726
doc 5 = 0.9999999776482582
doc 6 = 0.9999999776482582
doc 7 = 0.9999999776482582
doc 8 = 0.9999999930150807

此默认行为在文档中没有很清楚地说明。 minimum_probability 的默认值对于 get_document_topics方法是 None ,但是这不会将概率设置为零。而是 minimum_probability 的值设置为 minimum_probability 的值的 gensim.models.ldamodel.LdaModel对象,默认为 0.01,如 source code 中所示:

def __init__(self, corpus=None, num_topics=100, id2word=None,
distributed=False, chunksize=2000, passes=1, update_every=1,
alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10,
iterations=50, gamma_threshold=0.001, minimum_probability=0.01,
random_state=None, ns_conf=None, minimum_phi_value=0.01,
per_word_topics=False, callbacks=None, dtype=np.float32):

关于text-mining - gensim 的 get_document_topics 方法返回的概率不等于 1,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44571617/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com