python - 使用 nltk 雪球词干分析器将列中的值作为参数传递-6ren

python - 使用 nltk 雪球词干分析器将列中的值作为参数传递

转载作者：行者123 更新时间：2023-12-01 00:43:18

传递df[language]适用于停用词，但不适用于雪球词干分析器。有什么办法可以解决这个问题吗？

到目前为止我还没有真正找到任何线索......

import nltk
from nltk.corpus import stopwords
import pandas as pd
import re

df = pd.DataFrame([['A sentence in English', 'english'], ['En mening på svenska', 'swedish']], columns = ['text', 'language'])

def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

def remove_stopwords(tokenized_list, language):
    stopword = nltk.corpus.stopwords.words(language)
    text = [word for word in tokenized_list if word not in stopword]
    return text

def stemming(tokenized_text, l):
    ss = nltk.stem.SnowballStemmer(l)
    text = [ss.stem(word) for word in tokenized_text]
    return text

df['text_tokenized'] = df['text'].apply(lambda x: tokenize(x.lower()))
df['text_nostop'] = df['text_tokenized'].apply(lambda x: remove_stopwords(x, df['language']))
df['text_stemmed'] = df['text_nostop'].apply(lambda x: stemming(x, df['language']))

我希望它能够使用英语和瑞典语作为语言进行滚雪球词干提取，就像删除停用词一样。我收到如下错误消息:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

最佳答案

试试这个。

df['text_stemmed']=df.apply(lambda x: stemming(x['text_nostop'], x['language']), axis=1)

编辑:当您在特定列(例如 df['text_tokenized'].apply(lambda x: ...))上使用 apply 时，lambda 函数将开启x，它是 text_tokenized 列的每一行，而 df['language'] 并不应用于特定行，而是应用于整个 pandas Series。

也就是说，当您尝试lambda x:remove_stopwords(x, df['language'])时，df['language']的返回值不是相应行的特定“语言”值，但它是包含“英语”和“瑞典语”的 pandas 系列。

0    english
1    swedish

因此，您的第二个带有 apply 的代码也应该更改:

df['text_nostop'] = df.apply(lambda x: remove_stopwords(x['text_tokenized'], x['language']), axis=1)

关于python - 使用 nltk 雪球词干分析器将列中的值作为参数传递，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57182902/

文章推荐：每组的 ASP.Net Repeater header (即月)

文章推荐： jquery - ajax 重新加载页面

文章推荐： asp.net - 如何在母版页cs文件中实例化用户控件

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 使用 nltk 雪球词干分析器将列中的值作为参数传递