gpt4 book ai didi

python - 如何在用于主题建模的引导式 LDA 中生成术语矩阵?

转载 作者:行者123 更新时间:2023-12-05 07:37:28 25 4
gpt4 key购买 nike

我目前正在分析在线评论。我想尝试 GuidedLDA ( https://medium.freecodecamp.org/how-we-changed-unsupervised-lda-to-semi-supervised-guidedlda-e36a95f3a164 ),因为有些主题重叠。我已经成功安装了这个包。但是,我不确定如何使用 excel 文档作为输入来生成文档术语矩阵(在网站代码中称为 X)和 vocab。有人可以帮忙吗?我尝试在各种论坛中进行在线搜索,但没有找到任何有效的方法。

最佳答案

来自文本挖掘包,TDM类摘录

导入报告

导入csv

导入操作系统

'''

导入词干分析器

'''

您可以将以下代码保存为单独的 python 文件,并将其作为常规模块导入您的代码中,例如 create_tdm.py

导入 create_tdm

X = create_tdm.TermDocumentMatrix("你的文本")

'''词汇'''

word2id = dict((v, idx) for idx, v in enumerate("your text"))

'''

确保引导词列表应该在你的文本中,否则你会得到关键错误,只是为了检查将 pandas 导入为 pd

c = pd.DataFrame(list(word2id))

'''

类 TermDocumentMatrix(对象):

"""
Class to efficiently create a term-document matrix.

The only initialization parameter is a tokenizer function, which should
take in a single string representing a document and return a list of
strings representing the tokens in the document. If the tokenizer
parameter is omitted it defaults to using textmining.simple_tokenize

Use the add_doc method to add a document (document is a string). Use the
write_csv method to output the current term-document matrix to a csv
file. You can use the rows method to return the rows of the matrix if
you wish to access the individual elements without writing directly to a
file.

"""

def __init__(self, tokenizer=simple_tokenize):
"""Initialize with tokenizer to split documents into words."""
# Set tokenizer to use for tokenizing new documents
self.tokenize = tokenizer
# The term document matrix is a sparse matrix represented as a
# list of dictionaries. Each dictionary contains the word
# counts for a document.
self.sparse = []
# Keep track of the number of documents containing the word.
self.doc_count = {}

def add_doc(self, document):
"""Add document to the term-document matrix."""
# Split document up into list of strings
words = self.tokenize(document)
# Count word frequencies in this document
word_counts = {}
for word in words:
word_counts[word] = word_counts.get(word, 0) + 1
# Add word counts as new row to sparse matrix
self.sparse.append(word_counts)
# Add to total document count for each word
for word in word_counts:
self.doc_count[word] = self.doc_count.get(word, 0) + 1

def rows(self, cutoff=2):
"""Helper function that returns rows of term-document matrix."""
# Get master list of words that meet or exceed the cutoff frequency
words = [word for word in self.doc_count \
if self.doc_count[word] >= cutoff]
# Return header
yield words
# Loop over rows
for row in self.sparse:
# Get word counts for all words in master list. If a word does
# not appear in this document it gets a count of 0.
data = [row.get(word, 0) for word in words]
yield data

def write_csv(self, filename, cutoff=2):
"""
Write term-document matrix to a CSV file.

filename is the name of the output file (e.g. 'mymatrix.csv').
cutoff is an integer that specifies only words which appear in
'cutoff' or more documents should be written out as columns in
the matrix.

"""
f = csv.writer(open(filename, 'wb'))
for row in self.rows(cutoff=cutoff):
f.writerow(row)

关于python - 如何在用于主题建模的引导式 LDA 中生成术语矩阵?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48594449/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com