gpt4 book ai didi

python - 每次我在同一个语料库上训练时,LDA 模型都会生成不同的主题

转载 作者:太空狗 更新时间:2023-10-29 17:22:47 26 4
gpt4 key购买 nike

我正在使用 python gensim 从一个包含 231 个句子的小型语料库训练一个 Latent Dirichlet Allocation (LDA) 模型。然而,每次我重复这个过程,它都会产生不同的主题。

为什么相同的LDA参数和语料库每次生成不同的主题?

以及如何稳定话题生成?

我正在使用这个语料库 ( http://pastebin.com/WptkKVF0 ) 和这个停用词列表 ( http://pastebin.com/LL7dqLcj ),这是我的代码:

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math

stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]

def generateTopics(corpus, dictionary):
# Build LDA model using the above corpus
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
corpus_lda = lda[corpus]

# Group topics with similar words together.
tops = set(lda.show_topics(50))
top_clusters = []
for l in tops:
top = []
for t in l.split(" + "):
top.append((t.split("*")[0], t.split("*")[1]))
top_clusters.append(top)

# Generate word only topics
top_wordonly = []
for i in top_clusters:
top_wordonly.append(":".join([j[1] for j in i]))

return lda, corpus_lda, top_clusters, top_wordonly

#######################################################################

# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
lemma = line.split("\t")[3]
documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)

for i in topic_wordonly:
print i

最佳答案

Why does the same LDA parameters and corpus generate different topics everytime?

因为 LDA 在训练和推理步骤中都使用了随机性。

And how do i stabilize the topic generation?

通过在每次训练模型或执行推理时将 numpy.random 种子重置为相同的值,使用 numpy.random.seed:

SOME_FIXED_SEED = 42

# before training/inference:
np.random.seed(SOME_FIXED_SEED)

(这很丑陋,它使 Gensim 结果难以重现;考虑提交补丁。我已经打开了一个 issue。)

关于python - 每次我在同一个语料库上训练时,LDA 模型都会生成不同的主题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15067734/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com