python - 如何指定spaCy根据句号识别句子-6ren

python - 如何指定spaCy根据句号识别句子

转载作者：行者123 更新时间：2023-12-01 07:21:56

25

4

我有以下文字

text = 'Shop 1 942.10 984.50 1023.90 1064.80 \n\nShop 2 first 12 months 1032.70 1079.10 1122.30 1167.20 \n\nShop 2 after 12 months 1045.50 1092.60 1136.30 1181.70 \n\nShop 3 1059.40 1107.10 1151.40 1197.40 \n\nShop 4 first 3 months 1072.60 1120.90 1165.70 1212.30 \n\nShop 4 after 3 months 1082.40 1131.10 1176.40 1223.40'

我通过用 ' 替换 \n\n 来清理它。 ' 使用此代码

text = text.replace('\n\n', '. ')

我用这样的简单通用模式构建了一个匹配器

nlp = spacy.load('en_core_web_lg',  disable=['ner'])
doc = nlp(text)
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': 'Shop'}, {'LIKE_NUM': True}] 
matcher.add('REV', None, pattern)

然后，我使用匹配器查找文本中由句号分隔的所有句子。

matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)         
    print(matched_span.sent, '\n')

我期望获得这些结果:

Shop 1
Shop 1 942.10 984.50 1023.90 1064.80 . 

Shop 2
Shop 2 first 12 months 1032.70 1079.10 1122.30 1167.20 . 

Shop 2
Shop 2 after 12 months 1045.50 1092.60 1136.30 1181.70 . 

Shop 3
Shop 3 1059.40 1107.10 1151.40 1197.40 . 

Shop 4
Shop 4 first 3 months 1072.60 1120.90 1165.70 1212.30 . 

Shop 4
Shop 4 after 3 months 1082.40 1131.10 1176.40 1223.40

但是，由于 spaCy 处理文本的方式，它没有用句号.分割句子，而是用一些不透明规则，我不知道它们是什么，我的代码返回了以下结果

Shop 1
Shop 1 942.10 

Shop 2
Shop 2 first 12 months 

Shop 2
Shop 2 after 12 months 1045.50 1092.60 

Shop 3
Shop 3 

Shop 4
Shop 4 first 3 months 

Shop 4
Shop 4 after 3 months

有没有办法指导/覆盖spaCy如何根据特定模式识别文本中的句子(在本例中为句号.)？

最佳答案

您可能想要做的是定义一个自定义句子分段器。 spaCy 使用的默认句子分段算法使用依存树来尝试找出句子的开始和结束位置。您可以通过创建自己的函数来定义句子边界并将其添加到 NLP 管道中来覆盖它。正在关注the example in spaCy's documentation :

import spacy

def custom_sentencizer(doc):
    ''' Look for sentence start tokens by scanning for periods only. '''
    for i, token in enumerate(doc[:-2]):  # The last token cannot start a sentence
        if token.text == ".":
            doc[i+1].is_sent_start = True
        else:
            doc[i+1].is_sent_start = False  # Tell the default sentencizer to ignore this token
    return doc

nlp = spacy.load('en_core_web_lg',  disable=['ner'])
nlp.add_pipe(custom_sentencizer, before="parser")  # Insert before the parser can build its own sentences
# text = ...
doc = nlp(text)

matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'Shop'}, {'LIKE_NUM': True}] 
matcher.add('REV', None, pattern)

matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = nlp(text2)[start:end]
    print(matched_span.text)         
    print(matched_span.sent, '\n')

# Shop 1
# Shop 1 942.10 984.50 1023.90 1064.80 .
# 
# Shop 2
# Shop 2 first 12 months 1032.70 1079.10 1122.30 1167.20 .
# 
# Shop 2
# Shop 2 after 12 months 1045.50 1092.60 1136.30 1181.70 .
# 
# Shop 3
# Shop 3 1059.40 1107.10 1151.40 1197.40 .
# 
# Shop 4
# Shop 4 first 3 months 1072.60 1120.90 1165.70 1212.30 .
# 
# Shop 4
# Shop 4 after 3 months 1082.40 1131.10 1176.40 1223.40

您的文本与自然语言有很大不同，因此 spaCy 表现不佳也就不足为奇了。它的内部模型是根据看起来明确像您在书本或互联网上阅读的文本的示例进行训练的，而您的示例看起来更像是机器可读的数字列表。例如，如果您使用的文本写得更像散文，它可能看起来像这样:

Shop 1's numbers were 942.10, 984.50, 1023.90, and 1064.80. Shop 2, for the first 12 months, had numbers 1032.70, 1079.10, 1122.30, and 1167.20. Shop 2, after 12 months, had 1045.50, 1092.60, 1136.30, and 1181.70. Shop 3: 1059.40, 1107.10, 1151.40, and 1197.40. Shop 4 in the first 3 months had numbers 1072.60, 1120.90, 1165.70, and 1212.30. After 3 months, Shop 4 had 1082.40, 1131.10, 1176.40, and 1223.40.

使用它作为输入使 spaCy 的默认解析器有更好的机会找出句子中断的位置，即使有所有其他标点符号:

text2 = "Shop 1's numbers were 942.10, 984.50, 1023.90, and 1064.80.  Shop 2, for the first 12 months, had numbers 1032.70, 1079.10, 1122.30, and 1167.20.  Shop 2, after 12 months, had 1045.50, 1092.60, 1136.30, and 1181.70.  Shop 3: 1059.40, 1107.10, 1151.40, and 1197.40.  Shop 4 in the first 3 months had numbers 1072.60, 1120.90, 1165.70, and 1212.30.  After 3 months, Shop 4 had 1082.40, 1131.10, 1176.40, and 1223.40."

nlp2 = spacy.load('en_core_web_lg',  disable=['ner'])  # default sentencizer
doc2 = nlp2(text2)
matches2 = matcher(doc2)  # same matcher
for match_id, start, end in matches2:
    matched_span = nlp2(text2)[start:end]
    print(matched_span.text)
    print(matched_span.sent, '\n')

# Shop 1
# Shop 1's numbers were 942.10, 984.50, 1023.90, and 1064.80.
#
# Shop 2
# Shop 2, for the first 12 months, had numbers 1032.70, 1079.10, 1122.30, and 1167.20.
#
# Shop 2
# Shop 2, after 12 months, had 1045.50, 1092.60, 1136.30, and 1181.70.
#
# Shop 3
# Shop 3: 1059.40, 1107.10, 1151.40, and 1197.40.
#
# Shop 4
# Shop 4 in the first 3 months had numbers 1072.60, 1120.90, 1165.70, and 1212.30.
#
# Shop 4
# After 3 months, Shop 4 had 1082.40, 1131.10, 1176.40, and 1223.40.

请注意，这并不是万无一失的，如果句子结构变得过于复杂或花哨，默认解析器仍然会困惑。一般来说，NLP，特别是 spaCy，并不是解析一个小数据集以每次都准确地提取特定值:它更多的是快速解析千兆字节的文档，并在统计意义上做得足够好，以便对数据执行有意义的计算。数据。

关于python - 如何指定spaCy根据句号识别句子，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57660268/

25

4

0

文章推荐： jquery 链接单击数据表不起作用

文章推荐： java - where().eq ("colName", new SelectArg()) 是什么意思？

文章推荐： java - 用值填充数组

文章推荐： python - 如何使用 qPython 将行插入 KDB

windows - gcc 可被 cmd 识别，但不能被 bash 识别
我使用的是linux的windows子系统，安装了ubuntu，bash运行流畅。我正在尝试使用make，似乎bash 无法识别gcc。尝试将其添加到 PATH，但没有任何改变。奇怪的是 - cmd
installation - Imagick 被 WAMPServer 识别，但不被 PHP 识别
ImageMagick 已正确安装。 WAMP 的“PHP 扩展”菜单也显示带有勾选的 php_imagick。除了 Apache 和系统环境变量外，phpinfo() 没有显示任何 imagick
deterministic - 如果一种语言 (L) 被 n 状态 NFA 识别，它是否也能被状态不超过 2^n 的 DFA 识别？
我是这么想的，因为上限是 2^n，并且考虑到它们都是有限机，n 状态 NFA 和具有 2^n 或更少状态的 DFA 的交集将是有效。我错了吗？最佳答案你是对的。 2^n 是一个上限，因此生成的
r - 识别/描述向量中具有特定值的连续几天的序列
我有一个大型数据集，其中包含每日值，指示一年中的特定一天是否特别热(用 1 或 0 表示)。我的目标是识别 3 个或更多特别炎热的日子的序列，并创建一个包含每个日子的长度以及开始和结束日期的新数据集。
识别 R 向量中的特定元素顺序模式
我有一个向量列表，每个向量看起来像这样 c("Japan", "USA", "country", "Japan", "source", "country", "UK", "source", "coun
c - 识别/防止静态缓冲区溢出的工具和方法
是否有任何工具或方法可以识别静态定义数组中的缓冲区溢出(即 char[1234] 而不是 malloc(1234))？昨天我花了大部分时间来追踪崩溃和奇怪的行为，最终证明是由以下行引起的: // e
python - 手动创建的snakemake通配符未使用/识别
我一直在尝试通过导入制表符分隔的文件来手动创建 Snakemake 通配符，如下所示: dataset sample species frr PRJNA493818_GSE120639_SRP1628
python - 手动创建的snakemake通配符未使用/识别
我一直在尝试通过导入制表符分隔的文件来手动创建 Snakemake 通配符，如下所示: dataset sample species frr PRJNA493818_GSE120639_SRP1628
c# - 人声识别/识别
我想录下某人的声音，然后根据我获得的关于他/她声音的信息，如果那个人再次说话，我就能认出来!问题是我没有关于哪些统计数据(如频率)导致人声差异的信息，如果有人可以帮助我如何识别某人的声音？在研究过程
c++ - 识别 “Enter”
我希望我的程序能够识别用户何时按下“enter”并继续循环播放。但是我不知道如何使程序识别“输入”。尝试了两种方法: string enter; string ent = "\n"; dice d1;
识别 Bash 脚本中文件扩展名的正则表达式模式对于捕获压缩文件不准确
我创建了这个带有一个参数(文件名)的 Bash 小脚本，该脚本应该根据文件的扩展名做出响应: #!/bin/bash fileFormat=${1} if [[ ${fileFormat} =~ [F
ios - 识别 subview
我正在寻找一种在 for 循环内迭代时识别 subview 对象的方法，我基本上通过执行 cell.contentView.subviews 从 UITableView 的 contentView 获
Swift CallKit 识别
我正在尝试在 Swift 中使用 CallKit 来识别调用者。我正在寻找一种通过发出 URL 请求来识别调用者的方法。例如:+1-234-45-241 给我打电话，我希望它向 mydomain.
javascript - 厚盒插件 - 识别
我将(相当古老的)插件称为“thickbox”，如下所述: 创建厚盒时，它包含基于查询的内容列表。使用 JavaScript 或 jQuery，我希望能够访问 type 的值(在上面的示例中 t
c++ - 识别/生成波形？
我想编写一些可以接受某种输入并将其识别为方波、三角波或某种波形的代码。我还需要一些产生所述波的方法。我确实有使用 C/C++ 的经验，但是，我不确定我将如何模拟所有这些。最终，我想将其转换为微 Co
C# 识别 for 循环中的项目
我创建了一个 for 循环，用于在每个部分显示 8 个项目，但我试图在循环中识别某些项目。例如，我想识别前两项，然后是第五项和第六项，但我的识别技术似乎是正确的。 for (int i = 0; i
ios - 识别 UIStoryboard
如何识别 UIStoryboard？该类具有创建和实例化的方法，但我没有看到带有类似name 的@property。例如获取 Storyboard对象 + storyboardWithName:b
识别 MSSQL 各个版本的版本号的方法
如何确定所运行的SQLServer2005的版本要确定所运行的SQLServer2005的版本，请使用SQLServerManagementStudio连接到SQLServer2005，然后运行
javascript - 识别 Javascript 中的函数名称或属性
这个问题在这里已经有了答案: How to check whether an object is a date? (26 个答案) 关闭2 年前。我正在使用一个 npm 模块，它在错误时抛出一个空
android - 后台 Activity 识别
我正在制作一个使用 ActivityRecognition API 在后台跟踪用户 Activity 的应用，如果用户在指定时间段(例如 1 小时)内停留在同一个地方，系统就会推送通知告诉用户去散步.

首页

博学

6Ren·AI

商城

python - 如何指定spaCy根据句号识别句子