gpt4 book ai didi

Python - NLTK 训练/测试分割

转载 作者:行者123 更新时间:2023-11-30 09:06:10 28 4
gpt4 key购买 nike

我一直在关注 SentDex 的 video series关于 NLTK 和 Python,并构建了一个使用各种模型确定评论情绪的脚本,例如逻辑回归。我担心的是,我认为 SentDex 的方法在确定用于训练的单词时包含测试集,这显然是不可取的(训练/测试分割发生在特征选择之后)。

(针对 Mohammed Kashif 的评论进行编辑)

完整代码:

import nltk
import numpy as np
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.classify import ClassifierI
from nltk.corpus import movie_reviews
from sklearn.naive_bayes import MultinomialNB

documents = [ (list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category) ]

all_words = []

for w in movie_reviews.words():
all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def find_features(documents):
words = set(documents)
features = {}
for w in word_features:
features[w] = (w in words)

return features

featuresets = [(find_features(rev), category) for (rev, category) in documents]

np.random.shuffle(featuresets)

training_set = featuresets[:1800]
testing_set = featuresets[1800:]

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)

已经尝试过:

documents = [ (list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category) ]

np.random.shuffle(documents)

training_set = documents[:1800]
testing_set = documents[1800:]

all_words = []

for w in documents.words():
all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def find_features(training_set):
words = set(training_set)
features = {}
for w in word_features:
features[w] = (w in words)

return features

featuresets = [(find_features(rev), category) for (rev, category) in training_set]

np.random.shuffle(featuresets)

training_set = featuresets
testing_set = testing_set

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)

产生错误:

Traceback (most recent call last):

File "", line 34, in print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)

File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\classify\util.py", line 87, in accuracy results = classifier.classify_many([fs for (fs, l) in gold])

File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\classify\scikitlearn.py", line 85, in classify_many X = self._vectorizer.transform(featuresets)

File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 291, in transform return self._transform(X, fitting=False)

File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 166, in _transform for f, v in six.iteritems(x):

File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\six.py", line 439, in iteritems return iter(getattr(d, _iteritems)(**kw))

AttributeError: 'list' object has no attribute 'items'

最佳答案

好的,代码中存在一些错误。我们将一一介绍它们。

首先,您的 documents 列表是一个元组列表,它没有 words() 方法。为了访问所有单词,请像这样更改 for 循环

all_words = []

for words_list, categ in documents: #<-- each wordlist is a list of words
for w in words_list: #<-- Then access each word in list
all_words.append(w.lower())

其次,您需要为训练集和测试集创建功能集。您仅使用了 training_set 的功能集。将代码改成这样

featuresets = [文档中的(rev,类别)的(find_features(rev),类别)]

np.random.shuffle(featuresets)

training_set = featuresets[:1800]
testing_set = featuresets[1800:]

所以最终的代码变成

documents = [ (list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category) ]

np.random.shuffle(documents)

training_set = documents[:1800]
testing_set = documents[1800:]

all_words = []

for words_list, categ in documents:
for w in words_list:
all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def find_features(training_set):
words = set(training_set)
features = {}
for w in word_features:
features[w] = (w in words)

return features

featuresets = [(find_features(rev), category) for (rev, category) in documents]

np.random.shuffle(featuresets)

training_set = featuresets[:1800]
testing_set = featuresets[1800:]

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)

关于Python - NLTK 训练/测试分割,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51326704/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com