gpt4 book ai didi

python - nltk 正则表达式分词器

转载 作者:太空狗 更新时间:2023-10-29 21:30:58 24 4
gpt4 key购买 nike

我尝试在 python 中使用 nltk 实现正则表达式分词器,但结果是这样的:

>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]

但想要的结果是这样的:

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

为什么?哪里错了?

最佳答案

你应该把所有的捕获组都变成非捕获组:

  • ([A-Z]\.)+ > (?:[A-Z]\.)+
  • \w+(-\w+)* -> \w+(?:-\w+)*
  • \$?\d+(\.\d+)?%?\$?\d+(?:\.\d+)?%?<

问题是 regexp_tokenize 似乎正在使用 re.findall,当模式中定义了多个捕获组时,它会返回捕获元组列表。参见 this nltk.tokenize package reference :

pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)

此外,我不确定您是否想使用匹配包含所有大写字母的范围的 :-_,将 - 放在字符类的末尾。

因此,使用

pattern = r'''(?x)          # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(?:-\w+)* # words with optional internal hyphens
| \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():_`-] # these are separate tokens; includes ], [
'''

关于python - nltk 正则表达式分词器,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36353125/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com