gpt4 book ai didi

python - 分词时结合单数和复数、动词和副词的 nltk 频率

转载 作者:行者123 更新时间:2023-11-30 23:08:11 24 4
gpt4 key购买 nike

我想计算频率,但我想组合名词和动词的单数和复数形式及其副词形式。请原谅糟糕的句子。例如:“那个咄咄逼人的人走过那边的房子,这是众多房子之一,咄咄逼人。”

标记和计算频率

import nltk
from nltk.tokenize import RegexpTokenizer
test = "That aggressive person walk by the house over there, one of many houses aggressively"
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
fdist = nltk.FreqDist(tokens)
common=fdist.most_common(100)

输出:[('房子', 1), ('积极', 1), ('通过', 1), ('那个', 1), ('房子', 1), ('超过', 1 ), ('那里', 1), ('步行', 1), ('人', 1), ('很多', 1), ('的', 1), ('侵略性', 1), ('一', 1), ('该', 1)]

我希望将 househouses 计为 ('house\houses', 2)aggressive > 和 aggressively 被计为 ('aggressive\aggressively',2)。这可能吗?如果不是,我该如何让它看起来像那样?

最佳答案

您需要词形还原

NLTK 包括一个基于 WordNet 的词形还原器:

import nltk
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
lemmatizer = nltk.stem.WordNetLemmatizer()
test = "That aggressive person walk by the house over there, one of many houses aggressively"
tokens = tokenizer.tokenize(test)
lemmas = [lemmatizer.lemmatize(t) for t in tokens]
fdist = nltk.FreqDist(lemmas)
common = fdist.most_common(100)

这会导致:

[('house', 2),
('aggressively', 1),
('by', 1),
('That', 1),
('over', 1),
('there', 1),
('walk', 1),
('person', 1),
('many', 1),
('of', 1),
('aggressive', 1),
('one', 1),
('the', 1)]

但是,WordNet 词形还原器不会合并aggressiveaggressively。还有其他词形还原器,它们可能可以满足您的需求。不过,首先,您可能需要考虑词干提取:

stemmer = nltk.stem.PorterStemmer()
stems = [stemmer.stem(t) for t in tokens]
nltk.FreqDist(stems).most_common()

这给你:

[(u'aggress', 2),
(u'hous', 2),
(u'there', 1),
(u'That', 1),
(u'of', 1),
(u'over', 1),
(u'walk', 1),
(u'person', 1),
(u'mani', 1),
(u'the', 1),
(u'one', 1),
(u'by', 1)]

现在计数看起来没问题!然而,您可能会因为词干不一定看起来像真实的单词而感到恼火......

关于python - 分词时结合单数和复数、动词和副词的 nltk 频率,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31847904/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com