- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我的原话是
Tsunami earthquakes have also been linked to the presence of a thin layer of subducted sedimentary rock along the uppermost part of the plate interface, as is thought to be present in areas of significant topography at the top of the oceanic crust, and where propagation was in an up-dip direction, possibly reaching the seafloor.
我将这个句子传递给斯坦福 NLP,并得到了很好的解析树:
(ROOT (S (NP (NN Tsunami) (NNS earthquakes)) (VP (VBP have) (ADVP (RB also)) (VP (VBN been) (VP (VBN linked) (PP (TO to) (NP (NP (DT the) (NN presence)) (PP (IN of) (NP (NP (DT a) (JJ thin) (NN layer)) (PP (IN of) (S (VP (VBN subducted) (NP (NP (JJ sedimentary) (NN rock)) (PP (IN along) (NP (NP (NP (DT the) (JJS uppermost) (NN part)) (PP (IN of) (NP (DT the) (NN plate) (NN interface)))) (, ,) (UCP (RB as) (S (VP (VBZ is) (VP (VBN thought) (S (VP (TO to) (VP (VB be) (ADJP (JJ present) (PP (IN in) (NP (NP (NNS areas)) (PP (IN of) (NP (JJ significant) (NN topography)))))) (PP (IN at) (NP (NP (DT the) (NN top)) (PP (IN of) (NP (DT the) (JJ oceanic) (NN crust))))))))))) (, ,) (CC and) (SBAR (WHADVP (WRB where)) (S (NP (NN propagation)) (VP (VBD was) (PP (IN in) (NP (DT an) (JJ up-dip) (NN direction))) (, ,) (ADVP (RB possibly))))))))) (S (VP (VBG reaching) (NP (DT the) (NN seafloor)))))))))))))) (. .)))
然后我将上面的字符串输入 NLTK.Tree:
pasrsd_tree = NLTK.Tree.fromstring(parsetree_string)
结果非常好:
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Tsunami']), Tree('NNS', ['earthquakes'])]), Tree('VP', [Tree('VBP', ['have']), Tree('ADVP', [Tree('RB', ['also'])]), Tree('VP', [Tree('VBN', ['been']), Tree('VP', [Tree('VBN', ['linked']), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['presence'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['thin']), Tree('NN', ['layer'])]), Tree('PP', [Tree('IN', ['of']), Tree('S', [Tree('VP', [Tree('VBN', ['subducted']), Tree('NP', [Tree('NP', [Tree('JJ', ['sedimentary']), Tree('NN', ['rock'])]), Tree('PP', [Tree('IN', ['along']), Tree('NP', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJS', ['uppermost']), Tree('NN', ['part'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['plate']), Tree('NN', ['interface'])])])]), Tree(',', [',']), Tree('UCP', [Tree('RB', ['as']), Tree('S', [Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBN', ['thought']), Tree('S', [Tree('VP', [Tree('TO', ['to']), Tree('VP', [Tree('VB', ['be']), Tree('ADJP', [Tree('JJ', ['present']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('NP', [Tree('NNS', ['areas'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('JJ', ['significant']), Tree('NN', ['topography'])])])])])]), Tree('PP', [Tree('IN', ['at']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['top'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['oceanic']), Tree('NN', ['crust'])])])])])])])])])])]), Tree(',', [',']), Tree('CC', ['and']), Tree('SBAR', [Tree('WHADVP', [Tree('WRB', ['where'])]), Tree('S', [Tree('NP', [Tree('NN', ['propagation'])]), Tree('VP', [Tree('VBD', ['was']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['up-dip']), Tree('NN', ['direction'])])]), Tree(',', [',']), Tree('ADVP', [Tree('RB', ['possibly'])])])])])])])])]), Tree('S', [Tree('VP', [Tree('VBG', ['reaching']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['seafloor'])])])])])])])])])])])])])]), Tree('.', ['.'])])])
我的问题是,给定 pared_tree,我怎样才能获得左层实体,例如洋壳顶部
、薄层
?
我认为解析树的级别可能很有用,但在查看树级别时我真的迷失了,不知道该怎么做。
我主要基于Python,斯坦福NLP结果是使用Python包装器获得的(https://bitbucket.org/torotoki/corenlp-python)。
有人可以帮助我并指出一些方向吗?
最佳答案
您可以尝试提取标记为NP
的子树:
>>> from nltk import Tree
>>> parsed_tree = Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Tsunami']), Tree('NNS', ['earthquakes'])]), Tree('VP', [Tree('VBP', ['have']), Tree('ADVP', [Tree('RB', ['also'])]), Tree('VP', [Tree('VBN', ['been']), Tree('VP', [Tree('VBN', ['linked']), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['presence'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['thin']), Tree('NN', ['layer'])]), Tree('PP', [Tree('IN', ['of']), Tree('S', [Tree('VP', [Tree('VBN', ['subducted']), Tree('NP', [Tree('NP', [Tree('JJ', ['sedimentary']), Tree('NN', ['rock'])]), Tree('PP', [Tree('IN', ['along']), Tree('NP', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJS', ['uppermost']), Tree('NN', ['part'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['plate']), Tree('NN', ['interface'])])])]), Tree(',', [',']), Tree('UCP', [Tree('RB', ['as']), Tree('S', [Tree('VP', [Tree('VBZ', ['is']), Tree('VP', [Tree('VBN', ['thought']), Tree('S', [Tree('VP', [Tree('TO', ['to']), Tree('VP', [Tree('VB', ['be']), Tree('ADJP', [Tree('JJ', ['present']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('NP', [Tree('NNS', ['areas'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('JJ', ['significant']), Tree('NN', ['topography'])])])])])]), Tree('PP', [Tree('IN', ['at']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['top'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['oceanic']), Tree('NN', ['crust'])])])])])])])])])])]), Tree(',', [',']), Tree('CC', ['and']), Tree('SBAR', [Tree('WHADVP', [Tree('WRB', ['where'])]), Tree('S', [Tree('NP', [Tree('NN', ['propagation'])]), Tree('VP', [Tree('VBD', ['was']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('DT', ['an']), Tree('JJ', ['up-dip']), Tree('NN', ['direction'])])]), Tree(',', [',']), Tree('ADVP', [Tree('RB', ['possibly'])])])])])])])])]), Tree('S', [Tree('VP', [Tree('VBG', ['reaching']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['seafloor'])])])])])])])])])])])])])]), Tree('.', ['.'])])])
>>> np = [" ".join(i.leaves()) for i in parsed_tree.subtrees() if i.label() == 'NP']
>>> np
['Tsunami earthquakes', 'the presence of a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'the presence', 'a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'a thin layer', 'sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'sedimentary rock', 'the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'the uppermost part of the plate interface', 'the uppermost part', 'the plate interface', 'areas of significant topography', 'areas', 'significant topography', 'the top of the oceanic crust', 'the top', 'the oceanic crust', 'propagation', 'an up-dip direction', 'the seafloor']
但这会产生很多噪音,所以我们假设没有一个单词是短语:
>>> np_mwe
['Tsunami earthquakes', 'the presence of a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'the presence', 'a thin layer of subducted sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly reaching the seafloor', 'a thin layer', 'sedimentary rock along the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'sedimentary rock', 'the uppermost part of the plate interface , as is thought to be present in areas of significant topography at the top of the oceanic crust , and where propagation was in an up-dip direction , possibly', 'the uppermost part of the plate interface', 'the uppermost part', 'the plate interface', 'areas of significant topography', 'significant topography', 'the top of the oceanic crust', 'the top', 'the oceanic crust', 'an up-dip direction', 'the seafloor']
仍然很吵,假设名词短语不应包含逗号(不一定是真的,但有用的技巧):
>>> np_mwe_nocomma = [j for j in [" ".join(i.leaves()) for i in parsed_tree.subtrees() if i.label() == 'NP'] if j.count(' ') > 0 and j.count(',') == 0]
>>> np_mwe_nocomma
['Tsunami earthquakes', 'the presence', 'a thin layer', 'sedimentary rock', 'the uppermost part of the plate interface', 'the uppermost part', 'the plate interface', 'areas of significant topography', 'significant topography', 'the top of the oceanic crust', 'the top', 'the oceanic crust', 'an up-dip direction', 'the seafloor']
现在我们很容易看到子树中的子树,所以让我们选择更大的子树:
>> x = []
>>> for i in sorted(np_mwe_nocomma, key=len, reverse=True):
... for j in x:
... if i in j:
... continue
... print i
... x.append(i)
...
the uppermost part of the plate interface
areas of significant topography
the top of the oceanic crust
significant topography
Tsunami earthquakes
the plate interface
an up-dip direction
the uppermost part
the oceanic crust
sedimentary rock
the presence
a thin layer
the seafloor
我不确定这是否能满足您的需要,但您对“实体”的定义需要更具体,否则解析器标记的几乎所有 NP 都可以是“实体”
关于python - 从 NLTK.tree 结果中获取实体,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26210567/
NLTK 感知器标记器的标记集是什么?预训练模型使用的语料库是什么? 我试图从NLTK网站上找到官方信息。但他们没有那个。 最佳答案 来自 https://github.com/nltk/nltk/p
我无法理解这两者之间的区别。不过,我了解到word_tokenize将Penn-Treebank用于标记化目的。但TweetTokenizer上的任何内容都不可用。对于哪种类型的数据,我应该使用Twe
我正在学习 NLTK 和我的 mac 工作正常,除非我在 FreqDist() 上遇到问题。 (我看到另一个关于 FreqDist() 的问题,但他收到了不同的错误消息。TypeError: unha
我尝试了正则表达式词干分析器,但我得到了数百个不相关的标记。我只是对“播放”词干感兴趣。这是我正在使用的代码: import nltk from nltk.book import * f = open
我正在尝试使用 NLTK 命名实体标记器来识别各种命名实体。在使用 Python 进行自然语言处理一书中,他们提供了常用命名实体的列表(表 7.4,如果有人好奇的话),其中包括:日期 6 月,2008
我有很多文本数据,我想进行分类。我逐 block 递增地获取这些数据(例如 500 个样本)。我想用这些 block 在 NLTK 中对 NaiveBayesClassifier 进行训练,但要进行零
我在尝试运行实体提取功能时遇到问题。我相信这是版本差异。以下工作示例在 2.0.4 中运行,但不在 3.0 中运行。我确实将一个函数调用:batch_ne_chunk 更改为:nltk.ne_chun
我正在使用 docker 运行一个使用 nltk、languagetool 等的 NLP 系统... 当我使用 docker-compose build --build-arg env=dev我收到警
我正在检查 NLTK 的命名实体识别功能。是否可以找出提取出的哪个关键字与原文最相关?另外,是否可以知道提取的关键字的类型(人/组织)? 最佳答案 如果你有一个训练有素的标注器,你可以先标注你的文本,
我用过这个代码: # Step 1 : TOKENIZE from nltk.tokenize import * words = word_tokenize(text) # Step 2 : POS
当我运行 nltk.gaac.demo() 时 如果我错过了什么,你能帮我吗?我收到以下错误。 我使用的是nltk 3.0.1 Python 3.4.1 (v3.4.1:c0e311e010fc, M
我刚刚读了一篇关于如何使用 MALLET 进行主题建模的精彩文章,但我在网上找不到任何将 MALLET 与 NLTK 进行比较的内容,而我已经有过一些经验。 它们之间的主要区别是什么? MALLET
我试过这个,但它不起作用 from nltk.corpus import stopwords stopwords_list = stopwords.words('arabic') print(stop
我正在构建一个同时使用 NLTK 和 Spacy 的应用程序,并通过 Poetry 管理依赖项。我可以通过将此行添加到我的 pyproject.toml 来下载 Spacy 数据。下 [tool.po
我正在尝试使用 RegexpTokenizer 对文本进行分词。 代码: from nltk.tokenize import RegexpTokenizer #from nltk.tokenize i
我很好奇是否有人熟悉使用 NLTK's BLEU score calculation 之间的区别和 SacreBLEU library . 特别是,我使用了两个库的句子 BLEU 分数,对整个数据集进
我正在使用 nltk.word_tokenize用于标记一些包含编程语言、框架等的句子,这些句子被错误标记。 例如: >>> tokenize.word_tokenize("I work with C
我无法理解两者之间的区别。不过,我开始知道 word_tokenize 使用 Penn-Treebank 进行标记化。但是 TweetTokenizer 上没有任何内容可用。对于哪种数据,我应该使用
我需要对多种语言的文本进行名称实体提取:西类牙语、葡萄牙语、希腊语、捷克语、中文。 是否有这两个功能的所有支持语言的列表?是否有使用其他语料库的方法,以便可以包含这些语言? 最佳答案 默认情况下,这两
我是 python 的新手并使用 nltk,所以实际上我有一个非常基本的问题,但在任何地方都找不到答案。 我想知道什么时候在 nltk 模块的函数之前使用 nltk.。我正在处理一些任务,在某些情况下
我是一名优秀的程序员,十分优秀!