gpt4 book ai didi

nlp - 添加 SpaCy Tokenizer 异常 : Do not split '>>'

转载 作者:行者123 更新时间:2023-12-02 00:58:41 33 4
gpt4 key购买 nike

我正在尝试添加一个异常(exception)来识别 '>>' 和 '>> ' 作为开始一个新句子的指示符。例如,

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'>> We should. >>No.')

for sent in doc.sents:
print (sent)

它打印出:
>> We should.
>
>
No.

但是,我希望它打印出来:
>> We should.
>> No.

提前感谢您的时间!

最佳答案

您需要创建一个 custom component .代码示例提供自定义句子切分example .从文档中,该示例执行以下操作:

Example of adding a pipeline component to prohibit sentence boundaries before certain tokens.



代码(根据您的需要调整示例):
import spacy


def prevent_sentence_boundaries(doc):
for token in doc:
if not can_be_sentence_start(token):
token.is_sent_start = False
return doc


def can_be_sentence_start(token):
if token.i > 0 and token.nbor(-1).text == '>':
return False
return True

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(prevent_sentence_boundaries, before='parser')

raw_text = u'>> We should. >> No.'
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
for sentence in sentences:
print(sentence)

输出
>> We should.
>> No.

关于nlp - 添加 SpaCy Tokenizer 异常 : Do not split '>>' ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52236776/

33 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com