作者热门文章
- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
下面是在 movie_reviews
数据集上为 unigram
模型训练 Naive Bayes Classifier
的代码。我想通过考虑 bigram
、trigram
模型来训练和分析它的性能。我们该怎么做。
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def create_word_features(words):
useful_words = [word for word in words if word not in stopwords.words("english")]
my_dict = dict([(word, True) for word in useful_words])
return my_dict
pos_data = []
for fileid in movie_reviews.fileids('pos'):
words = movie_reviews.words(fileid)
pos_data.append((create_word_features(words), "positive"))
neg_data = []
for fileid in movie_reviews.fileids('neg'):
words = movie_reviews.words(fileid)
neg_data.append((create_word_features(words), "negative"))
train_set = pos_data[:800] + neg_data[:800]
test_set = pos_data[800:] + neg_data[800:]
classifier = NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.util.accuracy(classifier, test_set)
最佳答案
from nltk import ngrams
def create_ngram_features(words, n=2):
ngram_vocab = ngrams(words, n)
my_dict = dict([(ng, True) for ng in ngram_vocab])
return my_dict
顺便说一句,如果您将特征化器更改为对停用词列表使用一个集合并只初始化一次,您的代码会快很多。
stoplist = set(stopwords.words("english"))
def create_word_features(words):
useful_words = [word for word in words if word not in stoplist]
my_dict = dict([(word, True) for word in useful_words])
return my_dict
真的应该有人告诉 NLTK 人员将停用词列表转换为集合类型,因为它“在技术上”是一个唯一列表(即集合)。
>>> from nltk.corpus import stopwords
>>> type(stopwords.words('english'))
<class 'list'>
>>> type(set(stopwords.words('english')))
<class 'set'>
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import ngrams
def create_ngram_features(words, n=2):
ngram_vocab = ngrams(words, n)
my_dict = dict([(ng, True) for ng in ngram_vocab])
return my_dict
for n in [1,2,3,4,5]:
pos_data = []
for fileid in movie_reviews.fileids('pos'):
words = movie_reviews.words(fileid)
pos_data.append((create_ngram_features(words, n), "positive"))
neg_data = []
for fileid in movie_reviews.fileids('neg'):
words = movie_reviews.words(fileid)
neg_data.append((create_ngram_features(words, n), "negative"))
train_set = pos_data[:800] + neg_data[:800]
test_set = pos_data[800:] + neg_data[800:]
classifier = NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.util.accuracy(classifier, test_set)
print(str(n)+'-gram accuracy:', accuracy)
[输出]:
1-gram accuracy: 0.735
2-gram accuracy: 0.7625
3-gram accuracy: 0.8275
4-gram accuracy: 0.8125
5-gram accuracy: 0.74
您的原始代码返回的精度为 0.725。
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import everygrams
def create_ngram_features(words, n=2):
ngram_vocab = everygrams(words, 1, n)
my_dict = dict([(ng, True) for ng in ngram_vocab])
return my_dict
for n in range(1,6):
pos_data = []
for fileid in movie_reviews.fileids('pos'):
words = movie_reviews.words(fileid)
pos_data.append((create_ngram_features(words, n), "positive"))
neg_data = []
for fileid in movie_reviews.fileids('neg'):
words = movie_reviews.words(fileid)
neg_data.append((create_ngram_features(words, n), "negative"))
train_set = pos_data[:800] + neg_data[:800]
test_set = pos_data[800:] + neg_data[800:]
classifier = NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.util.accuracy(classifier, test_set)
print('1-gram to', str(n)+'-gram accuracy:', accuracy)
[输出]:
1-gram to 1-gram accuracy: 0.735
1-gram to 2-gram accuracy: 0.7625
1-gram to 3-gram accuracy: 0.7875
1-gram to 4-gram accuracy: 0.8
1-gram to 5-gram accuracy: 0.82
关于python - 如何为 n-gram 训练朴素贝叶斯分类器 (movie_reviews),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48003907/
我是一名优秀的程序员,十分优秀!