gpt4 book ai didi

python - 从 NLTK.tree 结果中获取实体

转载 作者:行者123 更新时间:2023-12-01 05:01:02 27 4
gpt4 key购买 nike

我的原话是

Tsunami earthquakes have also been linked to the presence of a thin layer of subducted sedimentary rock along the uppermost part of the plate interface, as is thought to be present in areas of significant topography at the top of the oceanic crust, and where propagation was in an up-dip direction, possibly reaching the seafloor.

我将这个句子传递给斯坦福 NLP,并得到了很好的解析树:

(ROOT (S (NP (NN Tsunami) (NNS earthquakes)) (VP (VBP have) (ADVP (RB also)) (VP (VBN been) (VP (VBN linked) (PP (TO to) (NP (NP (DT the) (NN presence)) (PP (IN of) (NP (NP (DT a) (JJ thin) (NN layer)) (PP (IN of) (S (VP (VBN subducted) (NP (NP (JJ sedimentary) (NN rock)) (PP (IN along) (NP (NP (NP (DT the) (JJS uppermost) (NN part)) (PP (IN of) (NP (DT the) (NN plate) (NN interface)))) (, ,) (UCP (RB as) (S (VP (VBZ is) (VP (VBN thought) (S (VP (TO to) (VP (VB be) (ADJP (JJ present) (PP (IN in) (NP (NP (NNS areas)) (PP (IN of) (NP (JJ significant) (NN topography)))))) (PP (IN at) (NP (NP (DT the) (NN top)) (PP (IN of) (NP (DT the) (JJ oceanic) (NN crust))))))))))) (, ,) (CC and) (SBAR (WHADVP (WRB where)) (S (NP (NN propagation)) (VP (VBD was) (PP (IN in) (NP (DT an) (JJ up-dip) (NN direction))) (, ,) (ADVP (RB possibly))))))))) (S (VP (VBG reaching) (NP (DT the) (NN seafloor)))))))))))))) (. .)))

然后我将上面的字符串输入 NLTK.Tree:

pasrsd_tree = NLTK.Tree.fromstring(parsetree_string)

结果非常好:

Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Tsunami']), Tree('NNS', ['earthquakes'])]), Tree('VP', [Tree('VBP', ['have']), Tree('ADVP', [Tree('RB', ['also'])]), Tree('VP', [Tree('VBN', ['been']), Tree('VP', [Tree('VBN', ['linked']), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['presence'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['thin']), Tree('NN', ['layer'])]), Tree('PP', [Tree('IN', ['of']), Tree('S', [Tree('VP', [Tree('VBN', ['subducted']), Tree('NP', [Tree('NP', [Tree('JJ', ['sedimentary']), Tree('NN', ['rock'])]), Tree('PP', [Tree('IN', ['along']), Tree('NP', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJS', ['uppermost']), Tree('NN', ['part'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['plate']), Tree('NN', ['interface'])])])]), Tree(',', [',']), Tree('UCP', [Tree('RB', ['as']), Tree('S', [Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBN', ['thought']), Tree('S', [Tree('VP', [Tree('TO', ['to']), Tree('VP', [Tree('VB', ['be']), Tree('ADJP', [Tree('JJ', ['present']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('NP', [Tree('NNS', ['areas'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('JJ', ['significant']), Tree('NN', ['topography'])])])])])]), Tree('PP', [Tree('IN', ['at']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['top'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['oceanic']), Tree('NN', ['crust'])])])])])])])])])])]), Tree(',', [',']), Tree('CC', ['and']), Tree('SBAR', [Tree('WHADVP', [Tree('WRB', ['where'])]), Tree('S', [Tree('NP', [Tree('NN', ['propagation'])]), Tree('VP', [Tree('VBD', ['was']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['up-dip']), Tree('NN', ['direction'])])]), Tree(',', [',']), Tree('ADVP', [Tree('RB', ['possibly'])])])])])])])])]), Tree('S', [Tree('VP', [Tree('VBG', ['reaching']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['seafloor'])])])])])])])])])])])])])]), Tree('.', ['.'])])])

我的问题是,给定 pared_tree,我怎样才能获得左层实体,例如洋壳顶部薄层

我认为解析树的级别可能很有用,但在查看树级别时我真的迷失了,不知道该怎么做。

我主要基于Python,斯坦福NLP结果是使用Python包装器获得的(https://bitbucket.org/torotoki/corenlp-python)。

有人可以帮助我并指出一些方向吗?

最佳答案

您可以尝试提取标记为NP的子树:

>>> from nltk import Tree
>>> parsed_tree = Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Tsunami']), Tree('NNS', ['earthquakes'])]), Tree('VP', [Tree('VBP', ['have']), Tree('ADVP', [Tree('RB', ['also'])]), Tree('VP', [Tree('VBN', ['been']), Tree('VP', [Tree('VBN', ['linked']), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['presence'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['thin']), Tree('NN', ['layer'])]), Tree('PP', [Tree('IN', ['of']), Tree('S', [Tree('VP', [Tree('VBN', ['subducted']), Tree('NP', [Tree('NP', [Tree('JJ', ['sedimentary']), Tree('NN', ['rock'])]), Tree('PP', [Tree('IN', ['along']), Tree('NP', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJS', ['uppermost']), Tree('NN', ['part'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['plate']), Tree('NN', ['interface'])])])]), Tree(',', [',']), Tree('UCP', [Tree('RB', ['as']), Tree('S', [Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBN', ['thought']), Tree('S', [Tree('VP', [Tree('TO', ['to']), Tree('VP', [Tree('VB', ['be']), Tree('ADJP', [Tree('JJ', ['present']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('NP', [Tree('NNS', ['areas'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('JJ', ['significant']), Tree('NN', ['topography'])])])])])]), Tree('PP', [Tree('IN', ['at']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['top'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['oceanic']), Tree('NN', ['crust'])])])])])])])])])])]), Tree(',', [',']), Tree('CC', ['and']), Tree('SBAR', [Tree('WHADVP', [Tree('WRB', ['where'])]), Tree('S', [Tree('NP', [Tree('NN', ['propagation'])]), Tree('VP', [Tree('VBD', ['was']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['up-dip']), Tree('NN', ['direction'])])]), Tree(',', [',']), Tree('ADVP', [Tree('RB', ['possibly'])])])])])])])])]), Tree('S', [Tree('VP', [Tree('VBG', ['reaching']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['seafloor'])])])])])])])])])])])])])]), Tree('.', ['.'])])])

>>> np = [" ".join(i.leaves()) for i in parsed_tree.subtrees() if i.label() == 'NP']
>>> np
['Tsunami earthquakes', 'the presence of a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'the presence', 'a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'a thin layer', 'sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'sedimentary rock', 'the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'the uppermost part of the plate interface', 'the uppermost part', 'the plate interface', 'areas of significant topography', 'areas', 'significant topography', 'the top of the oceanic crust', 'the top', 'the oceanic crust', 'propagation', 'an up-dip direction', 'the seafloor']

但这会产生很多噪音,所以我们假设没有一个单词是短语:

>>> np_mwe
['Tsunami earthquakes', 'the presence of a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'the presence', 'a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'a thin layer', 'sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'sedimentary rock', 'the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'the uppermost part of the plate interface', 'the uppermost part', 'the plate interface', 'areas of significant topography', 'significant topography', 'the top of the oceanic crust', 'the top', 'the oceanic crust', 'an up-dip direction', 'the seafloor']

仍然很吵,假设名词短语不应包含逗号(不一定是真的,但有用的技巧):

>>> np_mwe_nocomma = [j for j in [" ".join(i.leaves()) for i in parsed_tree.subtrees() if i.label() == 'NP'] if j.count(' ') > 0 and j.count(',') == 0]
>>> np_mwe_nocomma
['Tsunami earthquakes', 'the presence', 'a thin layer', 'sedimentary rock', 'the uppermost part of the plate interface', 'the uppermost part', 'the plate interface', 'areas of significant topography', 'significant topography', 'the top of the oceanic crust', 'the top', 'the oceanic crust', 'an up-dip direction', 'the seafloor']

现在我们很容易看到子树中的子树,所以让我们选择更大的子树:

>> x = []
>>> for i in sorted(np_mwe_nocomma, key=len, reverse=True):
... for j in x:
... if i in j:
... continue
... print i
... x.append(i)
...
the uppermost part of the plate interface
areas of significant topography
the top of the oceanic crust
significant topography
Tsunami earthquakes
the plate interface
an up-dip direction
the uppermost part
the oceanic crust
sedimentary rock
the presence
a thin layer
the seafloor

我不确定这是否能满足您的需要,但您对“实体”的定义需要更具体,否则解析器标记的几乎所有 NP 都可以是“实体”

关于python - 从 NLTK.tree 结果中获取实体,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26210567/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com