gpt4 book ai didi

python - spaCy 的正则表达式与 Python 的正则表达式不同

转载 作者:太空宇宙 更新时间:2023-11-04 11:13:03 25 4
gpt4 key购买 nike

我有以下文字

text = 'Monday to Friday 12 midnight to 5am 30% . Midnight Friday to 6am Saturday 30% . 9pm Saturday to Midnight Saturday 25% . Midnight Saturday to 6am Sunday 100% . 6am Sunday to 9pm Sunday 50%'

当我使用普通正则表达式时,我得到了以下内容

import re
regex = '\d{1}[a|p]m'
re.findall(regex, text)

# Returned:
['5am', '6am', '9pm', '6am', '6am', '6pm']

但是,当我在 spaCy 中使用相同的 regex 时,我什么也没得到

from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_lg')

matcher = Matcher(nlp.vocab)
pattern = [{'TEXT': {'REGEX': '\d{1}[a|p]m'}}]
matcher.add('TIME', None, pattern)

doc = nlp(text)
matches = matcher(doc)

for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.sent.text)

这是否意味着我们不能在 spaCy 中使用普通正则表达式?如果是这样,您知道我在哪里可以学习 spaCy 的特殊正则表达式语法吗?谢谢。

最佳答案

你需要记住这里数字会和字母分开,看测试:

doc = nlp("1pm")
print([token.text for token in doc]) # => ['1', 'pm']

根据 Spacy docs :

If spaCy’s tokenization doesn’t match the tokens defined in a pattern, the pattern is not going to produce any results.

您需要使用基于规则的匹配来定义您自己的实体:

pattern = [{'LIKE_NUM': True}, {'LOWER': {'REGEX' : '^[ap]m$'}}]

然后将其添加到匹配器:

matcher.add('TIME', None, pattern)

并获取匹配项:

for match_id, start, end in matches:
span = doc[start:end] # The matched span
print(span.text)

完整演示:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

text = 'Monday to Friday 12 midnight to 5am 30% . Midnight Friday to 6am Saturday 30% . 9pm Saturday to Midnight Saturday 25% . Midnight Saturday to 6am Sunday 100% . 6am Sunday to 9pm Sunday 50%'
doc = nlp(text)

matcher = Matcher(nlp.vocab)
pattern = [{'LIKE_NUM': True}, {'LOWER': {'REGEX' : '^[ap]m$'}}]
matcher.add('TIME', None, pattern)

matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])
#=> [5am, 6am, 9pm, 6am, 6am, 9pm]

关于python - spaCy 的正则表达式与 Python 的正则表达式不同,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57727543/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com