所以,我是 Python 和 NLTK 的新手。我有一个名为 reviews.csv 的文件,其中包含从亚马逊提取的评论。我已将此 csv 文件的内容标记化并将其写入名为 csvfile.csv 的文件中。这是代码:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
import csv #CommaSpaceVariable
from nltk.corpus import stopwords
ps = PorterStemmer()
stop_words = set(stopwords.words("english"))
with open ('reviews.csv') as csvfile:
readCSV = csv.reader(csvfile,delimiter='.')
for lines in readCSV:
word1 = word_tokenize(str(lines))
print(word1)
with open('csvfile.csv','a') as file:
for word in word1:
file.write(word)
file.write('\n')
with open ('csvfile.csv') as csvfile:
readCSV1 = csv.reader(csvfile)
for w in readCSV1:
if w not in stopwords:
print(w)
我正在尝试对 csvfile.csv 执行词干提取。但是我得到这个错误:
Traceback (most recent call last):<br>
File "/home/aarushi/test.py", line 25, in <module> <br>
if w not in stopwords: <br>
TypeError: argument of type 'WordListCorpusReader' is not iterable
当你做的时候
from nltk.corpus import stopwords
stopwords
是指向 nltk
中的 CorpusReader
对象的变量。
您要查找的实际停用词(即停用词列表)会在您执行以下操作时实例化:
stop_words = set(stopwords.words("english"))
因此,当检查标记列表中的单词是否为停用词时,您应该这样做:
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
for w in tokenized_sent:
if w not in stop_words:
pass # Do something.
为了避免混淆,我通常将实际的停用词列表命名为stoplist
:
from nltk.corpus import stopwords
stoplist = set(stopwords.words("english"))
for w in tokenized_sent:
if w not in stoplist:
pass # Do something.
我是一名优秀的程序员,十分优秀!