gpt4 book ai didi

python - 如何改进 NLTK 句子分割?

转载 作者:太空宇宙 更新时间:2023-11-03 13:31:05 25 4
gpt4 key购买 nike

鉴于维基百科的段落:

An ambitious campus expansion plan was proposed by Fr. Vernon F.Gallagher in 1952. Assumption Hall, the first student dormitory, wasopened in 1954, and Rockwell Hall was dedicated in November 1958,housing the schools of business and law. It was during the tenure ofF. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put toaction.

我运行 NLTK nltk.sent_tokenize 来获取句子。这将返回:

['An ambitious campus expansion plan was proposed by Fr.', 
'Vernon F. Gallagher in 1952.',
'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.',
'It was during the tenure of Fr.',
'Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'
]

虽然 NTLK 可以处理 F. Henry J. McAnulty 作为一个整体,Fr 失败了。 Vernon F. Gallagher,这将句子一分为二。

正确的分词应该是:

[
'An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.',
'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.',
'It was during the tenure of Fr. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'
]

如何提高分词器的性能?

最佳答案

Kiss 和 Strunk (2006) Punkt 算法的出色之处在于它是无监督的。因此,给定一个新文本,您应该重新训练模型并将模型应用于您的文本,例如

>>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
>>> text = "An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law. It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."

# Training a new model with the text.
>>> tokenizer = PunktSentenceTokenizer()
>>> tokenizer.train(text)
<nltk.tokenize.punkt.PunktParameters object at 0x106c5d828>

# It automatically learns the abbreviations.
>>> tokenizer._params.abbrev_types
{'f', 'fr', 'j'}

# Use the customized tokenizer.
>>> tokenizer.tokenize(text)
['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]

如果在重新训练模型时没有足够的数据生成良好的统计数据,您还可以在训练前放入一个预先确定的缩写列表;见How to avoid NLTK's sentence tokenizer spliting on abbreviations?

>>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters

>>> punkt_param = PunktParameters()
>>> abbreviation = ['f', 'fr', 'k']
>>> punkt_param.abbrev_types = set(abbreviation)

>>> tokenizer = PunktSentenceTokenizer(punkt_param)
>>> tokenizer.train(text)
<nltk.tokenize.punkt.PunktParameters object at 0x106c5d828>

>>> tokenizer.tokenize(text)
['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]

关于python - 如何改进 NLTK 句子分割?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47274540/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com