gpt4 book ai didi

python - python中GSDMM的一个实际例子?

转载 作者:行者123 更新时间:2023-12-04 15:59:22 26 4
gpt4 key购买 nike

我想使用 GSDMM 为我的数据集中的一些推文分配主题。我发现的唯一示例( 12 )不够详细。我想知道您是否知道一个显示 GSDMM 是如何使用 python 实现的源代码(或者足够关心来做一个小例子)。

最佳答案

我终于为 GSDMM 编译了我的代码,并将它从头开始放在这里供其他人使用。希望这可以帮助。我试图对重要部分发表评论:

#turning sentences into words

data_words =[]
for doc in data:
doc = doc.split()
data_words.append(doc)


#building bi-grams

bigram = gensim.models.Phrases(vocabulary, min_count=5, threshold=100)

bigram_mod = gensim.models.phrases.Phraser(bigram)

print('done!')



# Removing stop Words

stop_words.extend(['from', 'rt'])

def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

data_words_nostops = remove_stopwords(vocabulary)


# Form Bigrams
data_words_bigrams = [bigram_mod[doc] for doc in data_words_nostops]



#lemmatization
data_lemmatized = []
for sent in data_words_bigrams:
doc = nlp(" ".join(sent))
data_lemmatized.append([token.lemma_ for token in doc if token.pos_ in ['NOUN', 'ADJ', 'VERB', 'ADV']])

docs = data_lemmatized
vocab = set(x for doc in docs for x in doc)

# Train a new model
import random
random.seed(1000)
# Init of the Gibbs Sampling Dirichlet Mixture Model algorithm
mgp = MovieGroupProcess(K=10, alpha=0.1, beta=0.1, n_iters=30)

vocab = set(x for doc in docs for x in doc)
n_terms = len(vocab)
n_docs = len(docs)

# Fit the model on the data given the chosen seeds
y = mgp.fit(docs, n_terms)

def top_words(cluster_word_distribution, top_cluster, values):
for cluster in top_cluster:
sort_dicts =sorted(mgp.cluster_word_distribution[cluster].items(), key=lambda k: k[1], reverse=True)[:values]
print('Cluster %s : %s'%(cluster,sort_dicts))
print(' — — — — — — — — — ')

doc_count = np.array(mgp.cluster_doc_count)
print('Number of documents per topic :', doc_count)
print('*'*20)

# Topics sorted by the number of document they are allocated to
top_index = doc_count.argsort()[-10:][::-1]
print('Most important clusters (by number of docs inside):', top_index)
print('*'*20)


# Show the top 10 words in term frequency for each cluster

top_words(mgp.cluster_word_distribution, top_index, 10)



希望这可以帮助!

关于python - python中GSDMM的一个实际例子?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62108771/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com