- android - 多次调用 OnPrimaryClipChangedListener
- android - 无法更新 RecyclerView 中的 TextView 字段
- android.database.CursorIndexOutOfBoundsException : Index 0 requested, 光标大小为 0
- android - 使用 AppCompat 时,我们是否需要明确指定其 UI 组件(Spinner、EditText)颜色
我想使用 Spacy 匹配器从维基百科中挖掘“是一个”(和其他)关系,以构建知识数据库。
我有以下代码:
nlp = spacy.load("en_core_web_lg")
text = u"""Garfield is a large comic strip cat that lives in Ohio. Cape Town is the oldest city in South Africa."""
doc = nlp(text)
sentence_spans = list(doc.sents)
# Write a pattern
pattern = [
{"POS": "PROPN", "OP": "+"},
{"LEMMA": "be"},
{"POS": "DET"},
{"POS": "ADJ", "OP": "*"},
{"POS": "NOUN", "OP": "+"}
]
# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IS_A_PATTERN", None, pattern)
matches = matcher(doc)
# Iterate over the matches and print the span text
for match_id, start, end in matches:
print("Match found:", doc[start:end].text)
不幸的是,这匹配:
Match found: Garfield is a large comic strip
Match found: Garfield is a large comic strip cat
Match found: Town is the oldest city
Match found: Cape Town is the oldest city
而我只想:
Match found: Garfield is a large comic strip cat
Match found: Cape Town is the oldest city
此外,我不介意能够声明匹配的第一部分必须是句子的主语,而最后部分必须是谓语。
我还想以这种方式分开返回:
['Garfield', 'is a', 'large comic strip cat', 'comic strip cat']
['Cape Town', 'is the', 'oldest city', 'city']
这样我就可以获得城市列表。
这在 Spacy 中是否可行,或者等效的 Python 代码是什么?
最佳答案
我认为您需要在这里进行一些句法分析。从句法的角度来看,你的句子看起来像
is
_______________|_____
| | cat
| | __________|________________
| | | | | | lives
| | | | | | _____|____
| | | | | | | in
| | | | | | | |
Garfield . a large comic strip that Ohio
is
________|____
| | city
| | ____|______
| | | | in
| | | | |
| Town | | Africa
| | | | |
. Cape the oldest South
(我使用了 this question 中的方法来绘制树)。
现在,您应该提取子树,而不是提取子串。实现这一点的最小代码将首先找到“是一个”模式,然后生成左右子树,如果它们附加到具有正确依赖性的“是一个”:
def get_head(sentence):
toks = [t for t in sentence]
for i, t in enumerate(toks):
if t.lemma_ == 'be' and i + 1 < len(toks) and toks[i+1].pos_ == 'DET':
yield t
def get_relations(text):
doc = nlp(text)
for sent in doc.sents:
for head in get_head(sent):
children = list(head.children)
if len(children) < 2:
continue
l, r = children[0:2]
# check that the left child is really a subject and the right one is a description
if l.dep_ == 'nsubj' and r.dep_ == 'attr':
yield l, r
for l, r in get_relations(text):
print(list(l.subtree), list(r.subtree))
它会输出类似的东西
[Garfield] [a, large, comic, strip, cat, that, lives, in, Ohio]
[Cape, Town] [the, oldest, city, in, South, Africa]
所以你至少正确地把左边和右边分开了。如果需要,您可以添加更多过滤器(例如 l.pos_ == 'PROPN'
)。另一个改进是处理带有 2 个以上“is”子项的情况(例如副词)。
现在,您可以根据需要修剪子树,生成更小的谓词(例如“大猫”、“漫画猫”、“脱衣舞猫”、“住在俄亥俄州的猫”等)。这种修剪的快速版本可以每次只看一个 child :
for l, r in get_relations(text):
print(list(l.subtree), list(r.subtree))
for c in r.children:
words = [r] + list(c.subtree)
print(' '.join([w.text for w in sorted(words, key=lambda x: x.i)]))
它会产生如下结果
[Garfield], [a, large, comic, strip, cat, that, lives, in, Ohio]
a cat
large cat
comic cat
strip cat
cat that lives in Ohio
[Cape, Town], [the, oldest, city, in, South, Africa]
the city
oldest city
city in South Africa
您看到一些子树是错误的:开普敦不是全局“最古老的城市”。但似乎你至少需要一些语义知识来过滤掉这些不正确的子树。
关于python - Spacy "is a"挖矿,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56808822/
尝试使用 CSS 创建一个倒 Angular (又名铲形)样式的矩形。让它在大多数情况下工作。它甚至是响应式的。但是,我有一个边框元素,在某些屏幕上,不是所有屏幕上,它将矩形 (div) 分成两半甚至
我是一名优秀的程序员,十分优秀!