gpt4 book ai didi

python - 如何调整 NLTK 句子标记器

转载 作者:IT老高 更新时间:2023-10-28 21:52:54 27 4
gpt4 key购买 nike

我正在使用 NLTK 分析一些经典文本,并且遇到了逐句标记文本的麻烦。例如,这是我从 Moby Dick 中得到的片段。 :

import nltk
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')

'''
(Chapter 16)
A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but
that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"
'''
sample = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'

print "\n-----\n".join(sent_tokenize.tokenize(sample))
'''
OUTPUT
"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs.
-----
Hussey?
-----
" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs.
-----
Hussey?
-----
"
'''

考虑到 Melville 的语法有点过时,我不指望完美,但 NLTK 应该能够处理终端双引号和像“Mrs.”这样的标题。但是,由于分词器是无监督训练算法的结果,我不知道如何修改它。

有人推荐更好的句子标记器吗?我更喜欢我可以破解的简单启发式算法,而不必训练自己的解析器。

最佳答案

您需要向标记器提供缩写列表,如下所示:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc'])
sentence_splitter = PunktSentenceTokenizer(punkt_param)
text = "is THAT what you mean, Mrs. Hussey?"
sentences = sentence_splitter.tokenize(text)

现在是:

['is THAT what you mean, Mrs. Hussey?']

更新:如果句子的最后一个单词附有撇号或引号(如 Hussey?'),这将不起作用。因此,解决此问题的一种快速而肮脏的方法是在撇号和句子结尾符号(.!?)之后的引号前放置空格:

text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "')

关于python - 如何调整 NLTK 句子标记器,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14095971/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com