gpt4 book ai didi

python - 使用 PerceptronTagger 阅读我自己的 NLTK 词性标记数据集

转载 作者:行者123 更新时间:2023-12-01 02:23:54 24 4
gpt4 key购买 nike

我对 NLTK 很陌生,对 Python 也很陌生。我想使用自己的数据集来训练和测试 NLTK 的感知器标记器。训练和测试数据具有以下格式(仅保存在txt文件中):

Pierre  NNP
Vinken NNP
, ,
61 CD
years NNS
old JJ
, ,
will MD
join VB
the DT
board NN
as IN
a DT
nonexecutive JJ
director NN
Nov. NNP
29 CD
. .

我想对数据调用这些函数:

perceptron_tagger = nltk.tag.perceptron.PerceptronTagger(load=False)
perceptron_tagger.train(train_data)
accuracy = perceptron_tagger.evaluate(test_data)

我已经尝试了一些方法,但我只是无法弄清楚数据的预期格式。任何帮助将不胜感激!谢谢

最佳答案

train() 的输入和evaluate() PerceptronTagger的功能需要一个元组列表的列表,其中每个内部列表是一个列表,每个元组是一对字符串。

<小时/>

给定train.txttest.txt :

$ cat train.txt 
This foo
is foo
a foo
sentence bar
. .

That foo
is foo
another foo
sentence bar
in foo
conll bar
format bar
. .

$ cat test.txt
What foo
is foo
this foo
sentence bar
? ?

How foo
about foo
that foo
sentence bar
? ?

将 CoNLL 格式的文件读入元组列表。

# Using https://github.com/alvations/lazyme
>>> from lazyme import per_section
>>> tagged_train_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('train.txt'))]

# Or otherwise

>>> def per_section(it, is_delimiter=lambda x: x.isspace()):
... """
... From http://stackoverflow.com/a/25226944/610569
... """
... ret = []
... for line in it:
... if is_delimiter(line):
... if ret:
... yield ret # OR ''.join(ret)
... ret = []
... else:
... ret.append(line.rstrip()) # OR ret.append(line)
... if ret:
... yield ret
...
>>>
>>> tagged_test_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('test.txt'))]
>>> tagged_test_sentences
[[('What', 'foo'), ('is', 'foo'), ('this', 'foo'), ('sentence', 'bar'), ('?', '?')], [('How', 'foo'), ('about', 'foo'), ('that', 'foo'), ('sentence', 'bar'), ('?', '?')]]

现在您可以训练/评估标记器:

>>> from lazyme import per_section
>>> tagged_train_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('train.txt'))]
>>> from nltk.tag.perceptron import PerceptronTagger
>>> pct = PerceptronTagger(load=False)
>>> pct.train(tagged_train_sentences)
>>> pct.tag('Where do I find a foo bar sentence ?'.split())
[('Where', 'foo'), ('do', 'foo'), ('I', '.'), ('find', 'foo'), ('a', 'foo'), ('foo', 'bar'), ('bar', 'foo'), ('sentence', 'bar'), ('?', '.')]
>>> tagged_test_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('test.txt'))]
>>> pct.evaluate(tagged_test_sentences)
0.8

关于python - 使用 PerceptronTagger 阅读我自己的 NLTK 词性标记数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47624347/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com