gpt4 book ai didi

python - 为什么NLTK中的pos_tag将 "please"标记为NN?

转载 作者:行者123 更新时间:2023-12-01 04:05:03 27 4
gpt4 key购买 nike

我有一个严重的问题:我已经下载了最新版本的 NLTK我得到了一个奇怪的 POS 输出:

import nltk
import re

sample_text="start please with me"
tokenized = nltk.sent_tokenize(sample_text)

for i in tokenized:
words=nltk.word_tokenize(i)
tagged=nltk.pos_tag(words)
chunkGram=r"""Chank___Start:{<VB|VBZ>*} """
chunkParser=nltk.RegexpParser(chunkGram)
chunked=chunkParser.parse(tagged)
print(chunked)

[输出]:

(S start/JJ please/NN with/IN me/PRP)

我不知道为什么“start”被标记为JJ和“请”为 NN

最佳答案

默认的 NLTK pos_tag 已经以某种方式了解到 please 是一个名词。在正确的英语中,这几乎在任何情况下都是不正确的,例如

>>> from nltk import pos_tag
>>> pos_tag('Please go away !'.split())
[('Please', 'NNP'), ('go', 'VB'), ('away', 'RB'), ('!', '.')]
>>> pos_tag('Please'.split())
[('Please', 'VB')]
>>> pos_tag('please'.split())
[('please', 'NN')]
>>> pos_tag('please !'.split())
[('please', 'NN'), ('!', '.')]
>>> pos_tag('Please !'.split())
[('Please', 'NN'), ('!', '.')]
>>> pos_tag('Would you please go away ?'.split())
[('Would', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('go', 'VB'), ('away', 'RB'), ('?', '.')]
>>> pos_tag('Would you please go away !'.split())
[('Would', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('go', 'VB'), ('away', 'RB'), ('!', '.')]
>>> pos_tag('Please go away ?'.split())
[('Please', 'NNP'), ('go', 'VB'), ('away', 'RB'), ('?', '.')]

以WordNet为基准,不应该出现please是名词的情况。

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('please')
[Synset('please.v.01'), Synset('please.v.02'), Synset('please.v.03'), Synset('please.r.01')]

但我认为这很大程度上是由于用于训练PerceptronTagger的文本造成的。而不是标记器本身的实现。

现在,我们来看看预训练的 PerceptronTragger 里面有什么,我们发现它只知道 1500 多个单词:

>>> from nltk import PerceptronTagger
>>> tagger = PerceptronTagger()
>>> tagger.tagdict['I']
'PRP'
>>> tagger.tagdict['You']
'PRP'
>>> tagger.tagdict['start']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'start'
>>> tagger.tagdict['Start']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'Start'
>>> tagger.tagdict['please']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'please'
>>> tagger.tagdict['Please']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'Please'
>>> len(tagger.tagdict)
1549

你可以做的一个技巧是破解标记器:

>>> tagger.tagdict['start'] = 'VB'
>>> tagger.tagdict['please'] = 'VB'
>>> tagger.tag('please start with me'.split())
[('please', 'VB'), ('start', 'VB'), ('with', 'IN'), ('me', 'PRP')]

但最合乎逻辑的做法是简单地重新训练标记器,请参阅 http://www.nltk.org/_modules/nltk/tag/perceptron.html#PerceptronTagger.train

<小时/>

如果您不想重新训练标注器,请参阅 Python NLTK pos_tag not returning the correct part-of-speech tag

最有可能的是,使用 StanfordPOSTagger 可以满足您的需求:

>>> from nltk import StanfordPOSTagger
>>> sjar = '/home/alvas/stanford-postagger/stanford-postagger.jar'
>>> m = '/home/alvas/stanford-postagger/models/english-left3words-distsim.tagger'
>>> spos_tag = StanfordPOSTagger(m, sjar)
>>> spos_tag.tag('Please go away !'.split())
[(u'Please', u'VB'), (u'go', u'VB'), (u'away', u'RB'), (u'!', u'.')]
>>> spos_tag.tag('Please'.split())
[(u'Please', u'VB')]
>>> spos_tag.tag('Please !'.split())
[(u'Please', u'VB'), (u'!', u'.')]
>>> spos_tag.tag('please !'.split())
[(u'please', u'VB'), (u'!', u'.')]
>>> spos_tag.tag('please'.split())
[(u'please', u'VB')]
>>> spos_tag.tag('Would you please go away !'.split())
[(u'Would', u'MD'), (u'you', u'PRP'), (u'please', u'VB'), (u'go', u'VB'), (u'away', u'RB'), (u'!', u'.')]
>>> spos_tag.tag('Would you please go away ?'.split())
[(u'Would', u'MD'), (u'you', u'PRP'), (u'please', u'VB'), (u'go', u'VB'), (u'away', u'RB'), (u'?', u'.')]

对于 Linux:请参阅 https://gist.github.com/alvations/e1df0ba227e542955a8a

对于 Windows:请参阅 https://gist.github.com/alvations/0ed8641d7d2e1941b9f9

关于python - 为什么NLTK中的pos_tag将 "please"标记为NN?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35737099/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com