gpt4 book ai didi

python - 解析 penn 语法树以提取其语法规则

转载 作者:太空宇宙 更新时间:2023-11-03 13:10:30 25 4
gpt4 key购买 nike

我有一个 PENN 语法树,我想递归地获取这棵树包含的所有规则。

(ROOT 
(S
(NP (NN Carnac) (DT the) (NN Magnificent))
(VP (VBD gave) (NP ((DT a) (NN talk))))
)
)

我的目标是获得如下语法规则:

ROOT --> S
S --> NP VP
NP --> NN
...

正如我所说,我需要递归地执行此操作,并且不需要 NLTK 包或任何其他模块或正则表达式。这是我到目前为止所拥有的。参数 tree 是在每个空间上 split 的 Penn-Tree。

def extract_rules(tree):
tree = tree[1:-1]
print("\n\n")

if len(tree) == 0:
return

root_node = tree[0]
print("Current Root: "+root_node)

remaining_tree = tree[1:]
right_side = []

temp_tree = list(remaining_tree)
print("remaining_tree: ", remaining_tree)
symbol = remaining_tree.pop(0)

print("Symbol: "+symbol)

if symbol not in ["(", ")"]:
print("CASE: No Brackets")
print("Rule: "+root_node+" --> "+str(symbol))

right_side.append(symbol)

elif symbol == "(":
print("CASE: Opening Bracket")
print("Temp Tree: ", temp_tree)
cursubtree_end = bracket_depth(temp_tree)
print("Subtree ends at position "+str(cursubtree_end)+" and Element is "+temp_tree[cursubtree_end])
cursubtree_start = temp_tree.index(symbol)

cursubtree = temp_tree[cursubtree_start:cursubtree_end+1]
print("Subtree: ", cursubtree)

rnode = extract_rules(cursubtree)
if rnode:
right_side.append(rnode)
print("Rule: "+root_node+" --> "+str(rnode))

print(right_side)
return root_node


def bracket_depth(tree):
counter = 0
position = 0
subtree = []

for i, char in enumerate(tree):
if char == "(":
counter = counter + 1
if char == ")":
counter = counter - 1

if counter == 0 and i != 0:
counter = i
position = i
break

subtree = tree[0:position+1]

return position

目前它适用于 S 的第一个子树,但所有其他子树都不会递归解析。很乐意提供任何帮助..

最佳答案

我倾向于让它尽可能简单,而不是尝试重新发明您目前不允许使用的解析模块。像这样的东西:

string = '''
(ROOT
(S
(NP (NN Carnac) (DT the) (NN Magnificent))
(VP (VBD gave) (NP (DT a) (NN talk)))
)
)
'''

def is_symbol_char(character):
'''
Predicate to test if a character is valid
for use in a symbol, extend as needed.
'''

return character.isalpha() or character in '-=$!?.'

def tokenize(characters):
'''
Process characters into a nested structure. The original string
'(DT the)' is passed in as ['(', 'D', 'T', ' ', 't', 'h', 'e', ')']
'''

tokens = []

while characters:
character = characters.pop(0)

if character.isspace():
pass # nothing to do, ignore it

elif character == '(': # signals start of recursive analysis (push)
characters, result = tokenize(characters)
tokens.append(result)

elif character == ')': # signals end of recursive analysis (pop)
break

elif is_symbol_char(character):
# if it looks like a symbol, collect all
# subsequents symbol characters
symbol = ''

while is_symbol_char(character):
symbol += character
character = characters.pop(0)

# push unused non-symbol character back onto characters
characters.insert(0, character)

tokens.append(symbol)

# Return whatever tokens we collected and any characters left over
return characters, tokens

def extract_rules(tokens):
''' Recursively walk tokenized data extracting rules. '''

head, *tail = tokens

print(head, '-->', *[x[0] if isinstance(x, list) else x for x in tail])

for token in tail: # recurse
if isinstance(token, list):
extract_rules(token)

characters, tokens = tokenize(list(string))

# After a successful tokenization, all the characters should be consumed
assert not characters, "Didn't consume all the input!"

print('Tokens:', tokens[0], 'Rules:', sep='\n\n', end='\n\n')

extract_rules(tokens[0])

输出

Tokens:

['ROOT', ['S', ['NP', ['NN', 'Carnac'], ['DT', 'the'], ['NN', 'Magnificent']], ['VP', ['VBD', 'gave'], ['NP', ['DT', 'a'], ['NN', 'talk']]]]]

Rules:

ROOT --> S
S --> NP VP
NP --> NN DT NN
NN --> Carnac
DT --> the
NN --> Magnificent
VP --> VBD NP
VBD --> gave
NP --> DT NN
DT --> a
NN --> talk

注意

我把你原来的树改成了这个子句:

(NP ((DT a) (NN talk)))

似乎不正确,因为它在网络上可用的语法 TreeMap 示器上生成了一个空节点,所以我将其简化为:

(NP (DT a) (NN talk))

根据需要进行调整。

关于python - 解析 penn 语法树以提取其语法规则,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44708743/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com