gpt4 book ai didi

python - 如何指定spaCy根据句号识别句子

转载 作者:行者123 更新时间:2023-12-01 07:21:56 25 4
gpt4 key购买 nike

我有以下文字

text = 'Shop 1 942.10 984.50 1023.90 1064.80 \n\nShop 2 first 12 months 1032.70 1079.10 1122.30 1167.20 \n\nShop 2 after 12 months 1045.50 1092.60 1136.30 1181.70 \n\nShop 3 1059.40 1107.10 1151.40 1197.40 \n\nShop 4 first 3 months 1072.60 1120.90 1165.70 1212.30 \n\nShop 4 after 3 months 1082.40 1131.10 1176.40 1223.40'

我通过用 ' 替换 \n\n 来清理它。 ' 使用此代码

text = text.replace('\n\n', '. ')

我用这样的简单通用模式构建了一个匹配器

nlp = spacy.load('en_core_web_lg',  disable=['ner'])
doc = nlp(text)
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': 'Shop'}, {'LIKE_NUM': True}]
matcher.add('REV', None, pattern)

然后,我使用匹配器查找文本中由句号分隔的所有句子

matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
print(matched_span.sent, '\n')

我期望获得这些结果:

Shop 1
Shop 1 942.10 984.50 1023.90 1064.80 .

Shop 2
Shop 2 first 12 months 1032.70 1079.10 1122.30 1167.20 .

Shop 2
Shop 2 after 12 months 1045.50 1092.60 1136.30 1181.70 .

Shop 3
Shop 3 1059.40 1107.10 1151.40 1197.40 .

Shop 4
Shop 4 first 3 months 1072.60 1120.90 1165.70 1212.30 .

Shop 4
Shop 4 after 3 months 1082.40 1131.10 1176.40 1223.40

但是,由于 spaCy 处理文本的方式,它没有用句号.分割句子,而是用一些不透明规则,我不知道它们是什么,我的代码返回了以下结果

Shop 1
Shop 1 942.10

Shop 2
Shop 2 first 12 months

Shop 2
Shop 2 after 12 months 1045.50 1092.60

Shop 3
Shop 3

Shop 4
Shop 4 first 3 months

Shop 4
Shop 4 after 3 months

有没有办法指导/覆盖spaCy如何根据特定模式识别文本中的句子(在本例中为句号.)?

最佳答案

您可能想要做的是定义一个自定义句子分段器。 spaCy 使用的默认句子分段算法使用依存树来尝试找出句子的开始和结束位置。您可以通过创建自己的函数来定义句子边界并将其添加到 NLP 管道中来覆盖它。正在关注the example in spaCy's documentation :

import spacy

def custom_sentencizer(doc):
''' Look for sentence start tokens by scanning for periods only. '''
for i, token in enumerate(doc[:-2]): # The last token cannot start a sentence
if token.text == ".":
doc[i+1].is_sent_start = True
else:
doc[i+1].is_sent_start = False # Tell the default sentencizer to ignore this token
return doc

nlp = spacy.load('en_core_web_lg', disable=['ner'])
nlp.add_pipe(custom_sentencizer, before="parser") # Insert before the parser can build its own sentences
# text = ...
doc = nlp(text)

matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'Shop'}, {'LIKE_NUM': True}]
matcher.add('REV', None, pattern)

matches = matcher(doc)
for match_id, start, end in matches:
matched_span = nlp(text2)[start:end]
print(matched_span.text)
print(matched_span.sent, '\n')

# Shop 1
# Shop 1 942.10 984.50 1023.90 1064.80 .
#
# Shop 2
# Shop 2 first 12 months 1032.70 1079.10 1122.30 1167.20 .
#
# Shop 2
# Shop 2 after 12 months 1045.50 1092.60 1136.30 1181.70 .
#
# Shop 3
# Shop 3 1059.40 1107.10 1151.40 1197.40 .
#
# Shop 4
# Shop 4 first 3 months 1072.60 1120.90 1165.70 1212.30 .
#
# Shop 4
# Shop 4 after 3 months 1082.40 1131.10 1176.40 1223.40

您的文本与自然语言有很大不同,因此 spaCy 表现不佳也就不足为奇了。它的内部模型是根据看起来明确像您在书本或互联网上阅读的文本的示例进行训练的,而您的示例看起来更像是机器可读的数字列表。例如,如果您使用的文本写得更像散文,它可能看起来像这样:

Shop 1's numbers were 942.10, 984.50, 1023.90, and 1064.80. Shop 2, for the first 12 months, had numbers 1032.70, 1079.10, 1122.30, and 1167.20. Shop 2, after 12 months, had 1045.50, 1092.60, 1136.30, and 1181.70. Shop 3: 1059.40, 1107.10, 1151.40, and 1197.40. Shop 4 in the first 3 months had numbers 1072.60, 1120.90, 1165.70, and 1212.30. After 3 months, Shop 4 had 1082.40, 1131.10, 1176.40, and 1223.40.

使用它作为输入使 spaCy 的默认解析器有更好的机会找出句子中断的位置,即使有所有其他标点符号:

text2 = "Shop 1's numbers were 942.10, 984.50, 1023.90, and 1064.80.  Shop 2, for the first 12 months, had numbers 1032.70, 1079.10, 1122.30, and 1167.20.  Shop 2, after 12 months, had 1045.50, 1092.60, 1136.30, and 1181.70.  Shop 3: 1059.40, 1107.10, 1151.40, and 1197.40.  Shop 4 in the first 3 months had numbers 1072.60, 1120.90, 1165.70, and 1212.30.  After 3 months, Shop 4 had 1082.40, 1131.10, 1176.40, and 1223.40."

nlp2 = spacy.load('en_core_web_lg', disable=['ner']) # default sentencizer
doc2 = nlp2(text2)
matches2 = matcher(doc2) # same matcher
for match_id, start, end in matches2:
matched_span = nlp2(text2)[start:end]
print(matched_span.text)
print(matched_span.sent, '\n')

# Shop 1
# Shop 1's numbers were 942.10, 984.50, 1023.90, and 1064.80.
#
# Shop 2
# Shop 2, for the first 12 months, had numbers 1032.70, 1079.10, 1122.30, and 1167.20.
#
# Shop 2
# Shop 2, after 12 months, had 1045.50, 1092.60, 1136.30, and 1181.70.
#
# Shop 3
# Shop 3: 1059.40, 1107.10, 1151.40, and 1197.40.
#
# Shop 4
# Shop 4 in the first 3 months had numbers 1072.60, 1120.90, 1165.70, and 1212.30.
#
# Shop 4
# After 3 months, Shop 4 had 1082.40, 1131.10, 1176.40, and 1223.40.

请注意,这并不是万无一失的,如果句子结构变得过于复杂或花哨,默认解析器仍然会困惑。一般来说,NLP,特别是 spaCy,并不是解析一个小数据集以每次都准确地提取特定值:它更多的是快速解析千兆字节的文档,并在统计意义上做得足够好,以便对数据执行有意义的计算。数据。

关于python - 如何指定spaCy根据句号识别句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57660268/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com