gpt4 book ai didi

Python从 Pandas 数据框中删除停用词

转载 作者:IT老高 更新时间:2023-10-28 22:17:17 26 4
gpt4 key购买 nike

我想从我的“推文”列中删除停用词。如何迭代每一行和每一项?

pos_tweets = [('I love this car', 'positive'),
('This view is amazing', 'positive'),
('I feel great this morning', 'positive'),
('I am so excited about the concert', 'positive'),
('He is my best friend', 'positive')]

test = pd.DataFrame(pos_tweets)
test.columns = ["tweet","class"]
test["tweet"] = test["tweet"].str.lower().str.split()

from nltk.corpus import stopwords
stop = stopwords.words('english')

最佳答案

我们可以从 nltk.corpus 导入 stopwords,如下所示。有了这个,我们用 Python 的列表理解和 pandas.DataFrame.apply 排除停用词。

# Import stopwords with nltk.
from nltk.corpus import stopwords
stop = stopwords.words('english')

pos_tweets = [('I love this car', 'positive'),
('This view is amazing', 'positive'),
('I feel great this morning', 'positive'),
('I am so excited about the concert', 'positive'),
('He is my best friend', 'positive')]

test = pd.DataFrame(pos_tweets)
test.columns = ["tweet","class"]

# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test)
# Out[40]:
# tweet class tweet_without_stopwords
# 0 I love this car positive I love car
# 1 This view is amazing positive This view amazing
# 2 I feel great this morning positive I feel great morning
# 3 I am so excited about the concert positive I excited concert
# 4 He is my best friend positive He best friend

也可以通过pandas.Series.str.replace来排除。

pat = r'\b(?:{})\b'.format('|'.join(stop))
test['tweet_without_stopwords'] = test['tweet'].str.replace(pat, '')
test['tweet_without_stopwords'] = test['tweet_without_stopwords'].str.replace(r'\s+', ' ')
# Same results.
# 0 I love car
# 1 This view amazing
# 2 I feel great morning
# 3 I excited concert
# 4 He best friend

如果不能导入停用词,可以如下下载。

import nltk
nltk.download('stopwords')

另一种回答方法是从 sklearn.feature_extraction 导入 text.ENGLISH_STOP_WORDS

# Import stopwords with scikit-learn
from sklearn.feature_extraction import text
stop = text.ENGLISH_STOP_WORDS

请注意,scikit-learn 停用词和 nltk 停用词中的单词数量不同。

关于Python从 Pandas 数据框中删除停用词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29523254/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com