python - 如何将主导主题、贡献百分比和主题关键字返回到原始模型-6ren

python - 如何将主导主题、贡献百分比和主题关键字返回到原始模型

转载作者：行者123 更新时间：2023-12-04 11:02:00

有很多 LDA Mallet 主题建模的例子，但没有一个展示如何向原始数据帧添加主导主题、百分比贡献和主题关键字。
让我们假设这是数据集和我的代码

数据集:

Document_Id   Text
1             'Here goes one example sentence that is generic'
2             'My car drives really fast and I have no brakes'
3             'Your car is slow and needs no brakes'
4             'Your and my vehicle are both not as fast as the airplane'

代码

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import pandas as pd
df = pd.read_csv('data_above.csv')
data = df.Text.values.tolist() 
# Assuming I have done all the preprocessing, lemmatization and so on and ended up with data_lemmatized:

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
model = gensim.models.ldamodel.LdaModel(corpus=corpus, 
        id2word=id2word, 
        num_topics=50,random_state=100, 
        chunksize = 1000, update_every=1, 
        passes=10, alpha='auto', per_word_topics=True)

我试过这样的事情，但它不起作用......

def format_topics_sentences(ldamodel, corpus, df):
    # Init output
    sent_topics_df = pd.DataFrame()
    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
    # Add original text to the end of the output
    contents = df
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)

最佳答案

我也在我的项目中使用了这段代码。它为您提供每个文档中的主题关键字和主要主题。
要获得每个主题中的文档百分比贡献，您可以使用:

topics_docs = list()
for m in ldamallet[corpus]:
    topics_docs.append(m)

topics_docs_dict = dict()
for i in range(len(df)):
    topics_docs_dict[df.loc[i]["Document_Id"]] = [doc for (topic, doc) in topics_docs[i]]

topics_docs_df = pd.DataFrame(data=topics_docs_dict)
docs_topics_df = topics_docs_df.transpose()

通过上面的代码，您将在 docs_topics_df 的行中拥有文档，在 docs_topics_df 的列中拥有主题，以及每个单元格中的百分比贡献。
** 我的代码有效，但它可能不是最有效的解决方案。如果您可以改进或提供其他解决方案，请编辑我的代码。

关于python - 如何将主导主题、贡献百分比和主题关键字返回到原始模型，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58749620/

文章推荐： r - ggplot中每个时间段的日期之间的阴影

算法 FPGA 主导 CPU
在我生命的大部分时间里，我都在为 CPU 编程；尽管对于大多数算法来说，big-Oh 运行时间在 CPU/FPGA 上保持不变，但常数却大不相同(例如，大量 CPU 功率被浪费在数据洗牌上；而对于 F

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 如何将主导主题、贡献百分比和主题关键字返回到原始模型