gpt4 book ai didi

python - 如何在 NLTK 中使用 pos_tag?

转载 作者:太空狗 更新时间:2023-10-30 02:25:51 29 4
gpt4 key购买 nike

所以我试图在列表中标记一堆单词(确切地说是词性标记),如下所示:

pos = [nltk.pos_tag(i,tagset='universal') for i in lw]

其中 lw 是单词列表(它真的很长,否则我会发布它,但它就像 [['hello'],['world']] (又名列表列表,每个列表包含一个单词)但是当我尝试运行它时,我得到:

Traceback (most recent call last):
File "<pyshell#183>", line 1, in <module>
pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
File "<pyshell#183>", line 1, in <listcomp>
pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\__init__.py", line 134, in pos_tag
return _pos_tag(tokens, tagset, tagger)
File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\__init__.py", line 102, in _pos_tag
tagged_tokens = tagger.tag(tokens)
File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 152, in tag
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 152, in <listcomp>
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 240, in normalize
elif word[0].isdigit():
IndexError: string index out of range

谁能告诉我为什么会出现这个错误,我是如何出现这个错误的,以及如何解决这个错误?非常感谢。

最佳答案

首先,使用人类可读的变量名,它有帮助 =)

接下来,pos_tag 输入的是一个字符串列表。原来如此

>>> from nltk import pos_tag
>>> sentences = [ ['hello', 'world'], ['good', 'morning'] ]
>>> [pos_tag(sent) for sent in sentences]
[[('hello', 'NN'), ('world', 'NN')], [('good', 'JJ'), ('morning', 'NN')]]

此外,如果您将输入作为原始字符串,则可以在 pos_tag 之前使用 word_tokenize:

>>> from nltk import pos_tag, word_tokenize
>>> a_sentence = 'hello world'
>>> word_tokenize(a_sentence)
['hello', 'world']
>>> pos_tag(word_tokenize(a_sentence))
[('hello', 'NN'), ('world', 'NN')]

>>> two_sentences = ['hello world', 'good morning']
>>> [word_tokenize(sent) for sent in two_sentences]
[['hello', 'world'], ['good', 'morning']]
>>> [pos_tag(word_tokenize(sent)) for sent in two_sentences]
[[('hello', 'NN'), ('world', 'NN')], [('good', 'JJ'), ('morning', 'NN')]]

如果段落中有句子,可以使用 sent_tokenize 将句子拆分。

>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> text = "Hello world. Good morning."
>>> sent_tokenize(text)
['Hello world.', 'Good morning.']
>>> [word_tokenize(sent) for sent in sent_tokenize(text)]
[['Hello', 'world', '.'], ['Good', 'morning', '.']]
>>> [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)]
[[('Hello', 'NNP'), ('world', 'NN'), ('.', '.')], [('Good', 'JJ'), ('morning', 'NN'), ('.', '.')]]

另请参阅:How to do POS tagging using the NLTK POS tagger in Python?

关于python - 如何在 NLTK 中使用 pos_tag?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47519987/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com