gpt4 book ai didi

python - 滥用nltk的word_tokenize(sent)的后果

转载 作者:太空狗 更新时间:2023-10-29 22:26:42 30 4
gpt4 key购买 nike

我正在尝试将一段分成几个词。我手边有可爱的 nltk.tokenize.word_tokenize(sent),但是 help(word_tokenize) 说,“这个分词器被设计为一次处理一个句子。”

有谁知道如果在段落中使用它会发生什么情况,即最多 5 个句子?我自己在几个短段落上尝试过,它似乎有效,但这并不是决定性的证据。

最佳答案

nltk.tokenize.word_tokenize(text) 只是一个薄的 wrapper function调用 TreebankWordTokenizer 实例的 tokenize 方法类,它显然使用简单的正则表达式来解析句子。

该类的文档指出:

This tokenizer assumes that the text has already been segmented into sentences. Any periods -- apart from those at the end of a string -- are assumed to be part of the word they are attached to (e.g. for abbreviations, etc), and are not separately tokenized.

标的tokenize方法本身非常简单:

def tokenize(self, text):
for regexp in self.CONTRACTIONS2:
text = regexp.sub(r'\1 \2', text)
for regexp in self.CONTRACTIONS3:
text = regexp.sub(r'\1 \2 \3', text)

# Separate most punctuation
text = re.sub(r"([^\w\.\'\-\/,&])", r' \1 ', text)

# Separate commas if they're followed by space.
# (E.g., don't separate 2,500)
text = re.sub(r"(,\s)", r' \1', text)

# Separate single quotes if they're followed by a space.
text = re.sub(r"('\s)", r' \1', text)

# Separate periods that come before newline or end of string.
text = re.sub('\. *(\n|$)', ' . ', text)

return text.split()

基本上,该方法通常做的是将位于字符串末尾的句点标记为单独的标记:

>>> nltk.tokenize.word_tokenize("Hello, world.")
['Hello', ',', 'world', '.']

任何落在字符串内的句点都被标记为单词的一部分,假设它是一个缩写:

>>> nltk.tokenize.word_tokenize("Hello, world. How are you?") 
['Hello', ',', 'world.', 'How', 'are', 'you', '?']

只要该行为是可以接受的,您就应该没问题。

关于python - 滥用nltk的word_tokenize(sent)的后果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19373296/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com