gpt4 book ai didi

python - NLTK 中的简单标记化问题

转载 作者:行者123 更新时间:2023-11-30 23:11:49 24 4
gpt4 key购买 nike

我想标记以下文本:

In Düsseldorf I took my hat off. But I can't put it back on.


'In', 'Düsseldorf', 'I', 'took', 'my', 'hat', 'off', '.', 'But', 'I',
'can't', 'put', 'it', 'back', 'on', '.'

但令我惊讶的是 NLTK tokenizers work 都没有。我怎样才能完成?是否可以以某种方式使用这些标记器的组合来实现上述目标?

最佳答案

您可以将其中一个分词器作为起点,然后修复缩写(假设这就是问题所在):

from nltk.tokenize.treebank import TreebankWordTokenizer

text = "In Düsseldorf I took my hat off. But I can't put it back on."
tokens = TreebankWordTokenizer().tokenize(text)

contractions = ["n't", "'ll", "'m"]
fix = []
for i in range(len(tokens)):
for c in contractions:
if tokens[i] == c: fix.append(i)

fix_offset = 0
for fix_id in fix:
idx = fix_id - 1 - fix_offset
tokens[idx] = tokens[idx] + tokens[idx+1]
del tokens[idx+1]
fix_offset += 1

print(tokens)

>>>['在','杜塞尔多夫','我','拿了','我的','帽子','关闭','.','但是','我', “不能”、“放”、“它”、“后”、“上”、“。”]

关于python - NLTK 中的简单标记化问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30038793/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com