gpt4 book ai didi

document-classification - 结合 LSA/LSI 与朴素贝叶斯进行文档分类

转载 作者:行者123 更新时间:2023-12-01 06:24:55 25 4
gpt4 key购买 nike

我是新来的 gensim包和向量空间模型一般,我不确定我应该如何处理我的 LSA 输出。

为了简要概述我的目标,我想使用主题建模来增强朴素贝叶斯分类器,以改进评论(正面或负面)的分类。这是一个 great paper我一直在阅读那些塑造了我的想法但让我对实现仍然有些困惑的书。

我已经得到了朴素贝叶斯的工作代码——目前,我只是使用一元词袋,因为我的特征和标签要么是正面的,要么是负面的。

这是我的 gensim 代码

from pprint import pprint # pretty printer
import gensim as gs

# tutorial sample documents
docs = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]


# stoplist removal, tokenization
stoplist = set('for a of the and to in'.split())
# for each document: lowercase document, split by whitespace, and add all its words not in stoplist to texts
texts = [[word for word in doc.lower().split() if word not in stoplist] for doc in docs]


# create dict
dict = gs.corpora.Dictionary(texts)
# create corpus
corpus = [dict.doc2bow(text) for text in texts]

# tf-idf
tfidf = gs.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

# latent semantic indexing with 10 topics
lsi = gs.models.LsiModel(corpus_tfidf, id2word=dict, num_topics =10)

for i in lsi.print_topics():
print i

这是输出
0.400*"system" + 0.318*"survey" + 0.290*"user" + 0.274*"eps" + 0.236*"management" + 0.236*"opinion" + 0.235*"response" + 0.235*"time" + 0.224*"interface" + 0.224*"computer"
0.421*"minors" + 0.420*"graph" + 0.293*"survey" + 0.239*"trees" + 0.226*"paths" + 0.226*"intersection" + -0.204*"system" + -0.196*"eps" + 0.189*"widths" + 0.189*"quasi"
-0.318*"time" + -0.318*"response" + -0.261*"error" + -0.261*"measurement" + -0.261*"perceived" + -0.261*"relation" + 0.248*"eps" + -0.203*"opinion" + 0.195*"human" + 0.190*"testing"
0.416*"random" + 0.416*"binary" + 0.416*"generation" + 0.416*"unordered" + 0.256*"trees" + -0.225*"minors" + -0.177*"survey" + 0.161*"paths" + 0.161*"intersection" + 0.119*"error"
-0.398*"abc" + -0.398*"lab" + -0.398*"machine" + -0.398*"applications" + -0.301*"computer" + 0.242*"system" + 0.237*"eps" + 0.180*"testing" + 0.180*"engineering" + 0.166*"management"

任何建议或一般意见将不胜感激。

最佳答案

刚开始解决同样的问题,但使用 SVM,AFAIK 在训练您的模型后,您需要执行以下操作:

new_text = 'here is some document'
text_bow = dict.doc2bow(new_text)
vector = lsi[text_bow]

哪里 矢量是文档中的主题分布,长度等于您选择用于训练的主题数,在您的情况下为 10。
因此,您需要将所有文档表示为主题分布,而不是将它们提供给分类算法。

附言我知道这是一个老问题,但每次搜索时我都会在谷歌结果中看到它)

关于document-classification - 结合 LSA/LSI 与朴素贝叶斯进行文档分类,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29932784/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com