gpt4 book ai didi

python - 使用 NLTK for Python 训练用于情感分析的推文语料库

转载 作者:太空宇宙 更新时间:2023-11-03 17:45:55 24 4
gpt4 key购买 nike

我正在尝试使用 Python 的 NLTK 来训练自己的语料库进行情感分析。我有两个文本文件:一个包含 25K 条正面推文,每行分隔,另一个包含 25K 条负面推文。

I use this Stackoverflow article, method 2

当我运行此代码来创建语料库时:

import string
from itertools import chain

from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.corpus import CategorizedPlaintextCorpusReader
import nltk

mydir = 'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews'

mr = CategorizedPlaintextCorpusReader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

我收到错误消息:

C:\Users\gerbuiker\Anaconda\python.exe "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/CreateCorpus.py"
Traceback (most recent call last):
File "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/CreateCorpus.py", line 23, in <module>
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\corpus\reader\util.py", line 336, in iterate_from
assert self._len is not None
AssertionError

Process finished with exit code 1

有人知道如何解决这个问题吗?

最佳答案

我不是 100% 肯定,因为我目前不在 Windows 机器上测试这个,但我认为可能让您感兴趣的是 @alvas 原始示例中的路径斜杠方向与您的路径斜杠方向之间的差异适应windows。

具体来说,您使用:'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews',而他的示例使用'/home/alvas/my_movie_reviews'。在大多数情况下,这很好,但您尝试重新使用他的 cat_pattern 正则表达式:r'(neg|pos)/.*' 它将匹配中的斜杠他的道路,但拒绝你的道路。

关于python - 使用 NLTK for Python 训练用于情感分析的推文语料库,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29800109/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com