gpt4 book ai didi

python - 从 Python 代码字符串(正则表达式或 AST)中提取所有变量

转载 作者:行者123 更新时间:2023-12-01 07:10:36 25 4
gpt4 key购买 nike

我想查找并提取包含 Python 代码的字符串中的所有变量。我只想提取变量(以及带下标的变量),而不是函数调用。

例如,来自以下字符串:

code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'

我想提取:foobar[1]baz[1:10:var1[2+1]]var1[2+1]qux[[1,2,int(var2)]]var2bob[len (“foobar”)]var3[0]。请注意,某些变量可能是“嵌套的”。例如,从 baz[1:10:var1[2+1]] 我想提取 baz[1:10:var1[2+1]]var1[2+1]

首先想到的两个想法是使用正则表达式或 AST。我都尝试过,但都没有成功。

当使用正则表达式时,为了使事情变得更简单,我认为首先提取“顶级”变量,然后递归嵌套变量是个好主意。不幸的是,我什至无法做到这一点。

这是我到目前为止所拥有的:

regex = r'[_a-zA-Z]\w*\s*(\[.*\])?'
for match in re.finditer(regex, code):
print(match)

这是一个演示:https://regex101.com/r/INPRdN/2

另一个解决方案是使用 AST,扩展 ast.NodeVisitor,并实现 visit_Namevisit_Subscript 方法。但是,这也不起作用,因为 visit_Name 也会被函数调用。

如果有人能为我提供此问题的解决方案(正则表达式或 AST),我将不胜感激。

谢谢。

最佳答案

我发现你的问题是一个有趣的挑战,所以这里有一个代码可以做你想要的事情,单独使用 Regex 来做到这一点是不可能的,因为存在嵌套表达式,这是一个结合使用的解决方案用于处理嵌套表达式的正则表达式和字符串操作:

# -*- coding: utf-8 -*-
import re
RE_IDENTIFIER = r'\b[a-z]\w*\b(?!\s*[\[\("\'])'
RE_INDEX_ONLY = re.compile(r'(##)(\d+)(##)')
RE_INDEX = re.compile('##\d+##')


def extract_expression(string):
""" extract all identifier and getitem expression in the given order."""

def remove_brackets(text):
# 1. handle `[...]` expression replace them with #{#...#}#
# so we don't confuse them with word[...]
pattern = '(?<!\w)(\s*)(\[)([^\[]+?)(\])'
# keep extracting expression until there is no expression
while re.search(pattern, text):
text = re.sub(pattern, r'\1#{#\3#}#', string)
return text

def get_ordered_subexp(exp):
""" get index of nested expression."""
index = int(exp.replace('#', ''))
subexp = RE_INDEX.findall(expressions[index])
if not subexp:
return exp
return exp + ''.join(get_ordered_subexp(i) for i in subexp)

def replace_expression(match):
""" save the expression in the list, replace it with special key and it's index in the list."""
match_exp = match.group(0)
current_index = len(expressions)
expressions.append(None) # just to make sure the expression is inserted before it's inner identifier
# if the expression contains identifier extract too.
if re.search(RE_IDENTIFIER, match_exp) and '[' in match_exp:
match_exp = re.sub(RE_IDENTIFIER, replace_expression, match_exp)
expressions[current_index] = match_exp
return '##{}##'.format(current_index)

def fix_expression(match):
""" replace the match by the corresponding expression using the index"""
return expressions[int(match.group(2))]

# result that will contains
expressions = []

string = remove_brackets(string)

# 2. extract all expression and keep track of there place in the original code
pattern = r'\w+\s*\[[^\[]+?\]|{}'.format(RE_IDENTIFIER)
# keep extracting expression until there is no expression
while re.search(pattern, string):
# every exression that is extracted is replaced by a special key
string = re.sub(pattern, replace_expression, string)
# some times inside brackets can contains getitem expression
# so when we extract that expression we handle the brackets
string = remove_brackets(string)

# 3. build the correct result with extracted expressions
result = [None] * len(expressions)
for index, exp in enumerate(expressions):
# keep replacing special keys with the correct expression
while RE_INDEX_ONLY.search(exp):
exp = RE_INDEX_ONLY.sub(fix_expression, exp)
# finally we don't forget about the brackets
result[index] = exp.replace('#{#', '[').replace('#}#', ']')

# 4. Order the index that where extracted
ordered_index = ''.join(get_ordered_subexp(exp) for exp in RE_INDEX.findall(string))
# convert it to integer
ordered_index = [int(index[1]) for index in RE_INDEX_ONLY.findall(ordered_index)]

# 5. fix the order of expressions using the ordered indexes
final_result = []
for exp_index in ordered_index:
final_result.append(result[exp_index])

# for debug:
# print('final string:', string)
# print('expression :', expressions)
# print('order_of_expresion: ', ordered_index)
return final_result


code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
code2 = 'baz[1:10:var1[2+1]]'
code3 = 'baz[[1]:10:var1[2+1]:[var3[3+1*x]]]'
print(extract_expression(code))
print(extract_expression(code2))
print(extract_expression(code3))

输出:

['foo', 'bar[1]', 'baz[1:10:var1[2+1]]', 'var1[2+1]', 'qux[[1,2,int(var2)]]', 'var2', 'bob[len("foobar")]', 'var3[0]']
['baz[1:10:var1[2+1]]', 'var1[2+1]']
['baz[[1]:10:var1[2+1]:[var3[3+1*x]]]', 'var1[2+1]', 'var3[3+1*x]', 'x']

我针对非常复杂的示例测试了这段代码,它运行得很好。并注意提取的顺序与您想要的相同,希望这就是您所需要的。

关于python - 从 Python 代码字符串(正则表达式或 AST)中提取所有变量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58237331/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com