gpt4 book ai didi

python - nltk 的 RegexpParser 中的递归

转载 作者:太空狗 更新时间:2023-10-30 02:51:28 26 4
gpt4 key购买 nike

基于grammar in the chapter 7 of the NLTK Book :

grammar = r"""
NP: {<DT|JJ|NN.*>+} # ...
"""

我想扩展 NP(名词短语)以包含由 CC 连接的多个 NP(并列连词:and) 或 ,(逗号)来捕获名词短语,例如:

  • 房子和树
  • 苹果、橙子和芒果
  • 汽车、房子和飞机

我无法修改语法来将它们捕获为单个 NP:

import nltk

grammar = r"""
NP: {<DT|JJ|NN.*>+(<CC|,>+<NP>)?}
"""

sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))

结果:

(S (NP The/DT house/NN) and/CC (NP tree/NN))

我试过将 NP 移到开头:NP: {(<NP><CC|,>+)?<DT|JJ|NN.*>+}但我得到了相同的结果

(S (NP The/DT house/NN) and/CC (NP tree/NN))

最佳答案

让我们从小事做起,正确地捕捉 NP(名词短语):

import nltk

grammar = r"""
NP: {<DT|JJ|NN.*>+}
"""

sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))

[输出]:

(S (NP The/DT house/NN) and/CC (NP tree/NN))

现在让我们尝试捕捉 and/CC .只需添加一个更高级别的短语即可重新使用 <NP>规则:

grammar = r"""
NP: {<DT|JJ|NN.*>+}
CNP: {<NP><CC><NP>}
"""

sentence = 'The house and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))

[输出]:

(S (CNP (NP The/DT house/NN) and/CC (NP tree/NN)))

现在我们捕获了 NP CC NP短语,让我们花点时间看看它是否能捕获逗号:

grammar = r"""
NP: {<DT|JJ|NN.*>+}
CNP: {<NP><CC|,><NP>}
"""

sentence = 'The house, the bear and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))

现在我们看到它仅限于捕捉第一个左边界 NP CC|, NP并留下最后一个NP。

由于我们知道英语中的连词短语有左界连词和右界NP,即CC|, NP ,例如and the tree ,我们看到 CC|, NP模式是重复的,因此我们可以将其用作中间表示。

grammar = r"""
NP: {<DT|JJ|NN.*>+}
XNP: {<CC|,><NP>}
CNP: {<NP><XNP>+}
"""

sentence = 'The house, the bear and tree'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))

[输出]:

(S
(CNP
(NP The/DT house/NN)
(XNP ,/, (NP the/DT bear/NN))
(XNP and/CC (NP tree/NN))))

最终,CNP (连词 NPs)语法捕获英语中的链式名词短语连词,甚至是复杂的,例如

import nltk

grammar = r"""
NP: {<DT|JJ|NN.*>+}
XNP: {<CC|,><NP>}
CNP: {<NP><XNP>+}
"""

sentence = 'The house, the bear, the green house and a tree went to the park or the river.'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
print(chunkParser.parse(tagged))

[输出]:

(S
(CNP
(NP The/DT house/NN)
(XNP ,/, (NP the/DT bear/NN))
(XNP ,/, (NP the/DT green/JJ house/NN))
(XNP and/CC (NP a/DT tree/JJ)))
went/VBD
to/TO
(CNP (NP the/DT park/NN) (XNP or/CC (NP the/DT river/NN)))
./.)

如果您只是想从 How to Traverse an NLTK Tree object? 中提取名词短语:

noun_phrases = []

def traverse_tree(tree):
if tree.label() == 'CNP':
noun_phrases.append(' '.join([token for token, tag in tree.leaves()]))
for subtree in tree:
if type(subtree) == nltk.tree.Tree:
traverse_tree(subtree)

return noun_phrases

sentence = 'The house, the bear, the green house and a tree went to the park or the river.'
chunkParser = nltk.RegexpParser(grammar)
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
traverse_tree(chunkParser.parse(tagged))

[输出]:

['The house , the bear , the green house and a tree', 'the park or the river']

另请参阅 Python (NLTK) - more efficient way to extract noun phrases?

关于python - nltk 的 RegexpParser 中的递归,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55766558/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com