['she', "'s"] 但是,当使用仅具有收缩-6ren">
gpt4 book ai didi

python - 分词器扩展提取

转载 作者:行者123 更新时间:2023-12-01 07:17:00 27 4
gpt4 key购买 nike

我正在寻找一个可以扩展收缩的分词器。

使用 nltk 将短语拆分为标记,收缩不会扩展。

nltk.word_tokenize("she's")
-> ['she', "'s"]

但是,当使用仅具有收缩映射的字典时,因此不考虑周围单词提供的任何信息,无法决定“she's”是否应该映射到“she is”或“she has” .

是否有提供收缩扩展的分词器?

最佳答案

你可以做rule based matching使用 Spacy 来考虑周围单词提供的信息。我在下面编写了一些演示代码,您可以扩展它们以涵盖更多情况:

import spacy
from spacy.pipeline import EntityRuler
from spacy import displacy
from spacy.matcher import Matcher

sentences = ["now she's a software engineer" , "she's got a cat", "he's a tennis player", "He thinks that she's 30 years old"]

nlp = spacy.load('en_core_web_sm')

def normalize(sentence):
ans = []
doc = nlp(sentence)


#print([(t.text, t.pos_ , t.dep_) for t in doc])
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "got"}]
matcher.add("case_has", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"LOWER": "been"}]
matcher.add("case_has", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"POS": "DET"}]
matcher.add("case_is", None, pattern)
pattern = [{"POS": "PRON"}, {"LOWER": "'s"}, {"IS_DIGIT": True}]
matcher.add("case_is", None, pattern)
# .. add more cases

matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
for idx, t in enumerate(doc):
if string_id == 'case_has' and t.text == "'s" and idx >= start and idx < end:
ans.append("has")
continue
if string_id == 'case_is' and t.text == "'s" and idx >= start and idx < end:
ans.append("is")
continue
else:
ans.append(t.text)
return(' '.join(ans))

for s in sentences:
print(s)
print(normalize(s))
print()

输出:

now she's a software engineer
now she is a software engineer

she's got a cat
she has got a cat

he's a tennis player
he is a tennis player

He thinks that she's 30 years old
He thinks that she is 30 years is old

关于python - 分词器扩展提取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57905168/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com