gpt4 book ai didi

python - Nltk:从列表列表中消除停用词

转载 作者:行者123 更新时间:2023-11-28 22:23:38 27 4
gpt4 key购买 nike

我正在尝试删除停用词并尝试了以下操作:

tokenizer = RegexpTokenizer(r'\w+')
tokenized = data['data_column'].apply(tokenizer.tokenize)
tokenized

标记化后低于输出

0    [ANOTHER,SAMPLE,AS,OUTPUT,MSG...
1 [A,SAMPLE,TEXT,FOR,ILLUSTRATION...
Name: data_column, dtype: object

我尝试使用以下方法删除停用词:

stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in tokenized if not w in stop_words]
filtered_sentence = []
for w in tokenized:
if w not in stop_words:
filtered_sentence.append(w)

我得到错误:

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-272-d4a699384ffc> in <module>()
2 stop_words = set(stopwords.words('english'))
3
----> 4 filtered_sentence = [w for w in tokenized if not w in stop_words]
5
6 filtered_sentence = []

TypeError: unhashable type: 'list'

最佳答案

您需要 .apply() 来从系列列表中过滤列表,因为语料库包含您需要在搜索前使用 .lower() 的小写单词,即

stop_words = set(stopwords.words('english'))
filtered_sentence = tokenized.apply(lambda x : [w for w in x if w.lower() not in stop_words])

sample 运行

from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

df = pd.DataFrame({'words': [['A','SAMPLE','AS','OUTPUT','MSG']]})
df['words'].apply(lambda x : [i for i in x if not i.lower() in stop])

0 [SAMPLE, OUTPUT, MSG]
Name: words, dtype: object

关于python - Nltk:从列表列表中消除停用词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46914387/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com