gpt4 book ai didi

python - Spacy NLP - 使用正则表达式进行分块

转载 作者:行者123 更新时间:2023-11-30 22:46:56 24 4
gpt4 key购买 nike

Spacy 包含用于检索名词短语集的noun_chunks 功能。函数english_noun_chunks(附在下面)使用word.pos == NOUN

def english_noun_chunks(doc):
labels = ['nsubj', 'dobj', 'nsubjpass', 'pcomp', 'pobj',
'attr', 'root']
np_deps = [doc.vocab.strings[label] for label in labels]
conj = doc.vocab.strings['conj']
np_label = doc.vocab.strings['NP']
for i in range(len(doc)):
word = doc[i]
if word.pos == NOUN and word.dep in np_deps:
yield word.left_edge.i, word.i+1, np_label
elif word.pos == NOUN and word.dep == conj:
head = word.head
while head.dep == conj and head.head.i < head.i:
head = head.head
# If the head is an NP, and we're coordinated to it, we're an NP
if head.dep in np_deps:
yield word.left_edge.i, word.i+1, np_label

我想从维护一些正则表达式的句子中获取 block 。例如,I 短语由零个或多个形容词组成,后跟一个或多个名词。

{(<JJ>)*(<NN | NNS | NNP>)+}

是否可以不覆盖 english_noun_chunks 函数?

最佳答案

你可以重写这个函数而不损失任何性能,因为它是用纯Python实现的,但为什么不在获得这些 block 后过滤它们呢?

import re
import spacy

def filtered_chunks(doc, pattern):
for chunk in doc.noun_chunks:
signature = ''.join(['<%s>' % w.tag_ for w in chunk])
if pattern.match(signature) is not None:
yield chunk

nlp = spacy.load('en')
doc = nlp(u'Great work!')
pattern = re.compile(r'(<JJ>)*(<NN>|<NNS>|<NNP>)+')

print(list(filtered_chunks(doc, pattern)))

关于python - Spacy NLP - 使用正则表达式进行分块,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40716419/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com