gpt4 book ai didi

python - WordNetlemmatizer 错误 - 所有字母均已词形还原

转载 作者:太空宇宙 更新时间:2023-11-03 15:04:54 24 4
gpt4 key购买 nike

我正在尝试对我的数据集进行词形还原以进行情感分析 - 我应该怎么做才能获得预期输出而不是当前输出?输入文件是一个 csv - 存储为 DataFrame 对象。

dataset = pd.read_csv('xyz.csv')

这是我的代码

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
list1_ = []
for file_ in dataset:
result1 = dataset['Content'].apply(lambda x: [lemmatizer.lemmatize(y) for y in x])
list1_.append(result1)
dataset = pd.concat(list1_, ignore_index=True)

预期

>> lemmatizer.lemmatize('cats')
>> [cat]

电流输出

>> lemmatizer.lemmatize('cats')
>> [c,a,t,s]

最佳答案

TL;DR

result1 = dataset['Content'].apply(lambda x: [lemmatizer.lemmatize(y) for y in x.split()])

词形还原器接受任何字符串作为输入。

如果dataset['Content']列是字符串,则迭代字符串将迭代字符而不是“单词”,例如

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> x = 'this is a foo bar sentence, that is of type str'
>>> [wnl.lemmatize(ch) for ch in x]
['t', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 'f', 'o', 'o', ' ', 'b', 'a', 'r', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ',', ' ', 't', 'h', 'a', 't', ' ', 'i', 's', ' ', 'o', 'f', ' ', 't', 'y', 'p', 'e', ' ', 's', 't', 'r']

因此,您必须首先对句子字符串进行单词标记,例如:

>>> from nltk import word_tokenize
>>> [wnl.lemmatize(word) for word in x.split()]
['this', 'is', 'a', 'foo', 'bar', 'sentence,', 'that', 'is', 'of', 'type', 'str']
>>> [wnl.lemmatize(ch) for ch in word_tokenize(x)]
['this', 'is', 'a', 'foo', 'bar', 'sentence', ',', 'that', 'is', 'of', 'type', 'str']

另一个例如

>>> from nltk import word_tokenize
>>> x = 'the geese ran through the parks'
>>> [wnl.lemmatize(word) for word in x.split()]
['the', u'goose', 'ran', 'through', 'the', u'park']
>>> [wnl.lemmatize(ch) for ch in word_tokenize(x)]
['the', u'goose', 'ran', 'through', 'the', u'park']

但是为了获得更准确的词形还原,您应该对句子单词进行标记和后置标记,请参阅 https://github.com/alvations/earthy/blob/master/FAQ.md#how-to-use-default-nltk-functions-in-earthy

关于python - WordNetlemmatizer 错误 - 所有字母均已词形还原,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44752571/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com