gpt4 book ai didi

python - 有关如何解析自定义文件格式的提示

转载 作者:太空宇宙 更新时间:2023-11-03 12:15:41 25 4
gpt4 key购买 nike

抱歉,标题模糊,但我真的不知道如何简洁地描述这个问题。

我创建了一个(或多或少)简单的 domain-specific language我将使用它来指定适用于不同实体的验证规则(通常是从网页提交的表单)。我在这篇文章的底部包含了该语言的示例。

我的问题是我不知道如何开始将这种语言解析为我可以使用的形式(我将使用 Python 进行解析)。我的目标是最终得到一个应该(按顺序)应用于每个对象/实体(也字符串,例如 'chocolate''chocolate.lindt' 等)。

我不确定从什么技术入手,甚至不知道有什么技术可以解决这样的问题。你认为解决这个问题的最好方法是什么?我不是在寻找一个完整的解决方案,只是在正确的方向上进行一般性的插入。

谢谢。

语言示例文件:

# Comments start with the '#' character and last until the end of the line
# Indentation is significant (as in Python)


constant NINETY_NINE = 99 # Defines the constant `NINETY_NINE` to have the value `99`


*: # Applies to all data
isYummy # Everything must be yummy

chocolate: # To validate, say `validate("chocolate", object)`
sweet # chocolate must be sweet (but not necessarily chocolate.*)

lindt: # To validate, say `validate("chocolate.lindt", object)`
tasty # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)

*: # Applies to all data under chocolate.lindt
smooth # Could also be written smooth()
creamy(1) # Level 1 creamy
dark: # dark has no special validation rules
extraDark:
melt # Filter that modifies the object being examined
c:bitter # Must be bitter, but only validated on client
s:cocoa(NINETY_NINE) # Must contain 99% cocoa, but only validated on server. Note constant
milk:
creamy(2) # Level 2 creamy, overrides creamy(1) of chocolate.lindt.* for chocolate.lindt.milk
creamy(3) # Overrides creamy(2) of previous line (all but the last specification of a given rule are ignored)



ruleset food: # To define a chunk of validation rules that can be expanded from the placeholder `food` (think macro)
caloriesWithin(10, 2000) # Unlimited parameters allowed
edible
leftovers: # Nested rules allowed in rulesets
stale

# Rulesets may be nested and/or include other rulesets in their definition



chocolate: # Previously defined groups can be re-opened and expanded later
ferrero:
hasHazelnut



cake:
tasty # Same rule used for different data (see chocolate.lindt)
isLie
ruleset food # Substitutes with rules defined for food; cake.leftovers must now be stale


pasta:
ruleset food # pasta.leftovers must also be stale




# Sample use (in JavaScript):

# var choc = {
# lindt: {
# cocoa: {
# percent: 67,
# mass: '27g'
# }
# }
# // Objects/groups that are ommitted (e.g. ferrro in this example) are not validated and raise no errors
# // Objects that are not defined in the validation rules do not raise any errors (e.g. cocoa in this example)
# };
# validate('chocolate', choc);

# `validate` called isYummy(choc), sweet(choc), isYummy(choc.lindt), smooth(choc.lindt), creamy(choc.lindt, 1), and tasty(choc.lindt) in that order
# `validate` returned an array of any validation errors that were found

# Order of rule validation for objects:
# The current object is initially the object passed in to the validation function (second argument).
# The entry point in the rule group hierarchy is given by the first argument to the validation function.
# 1. First all rules that apply to all objects (defined using '*') are applied to the current object,
# starting with the most global rules and ending with the most local ones.
# 2. Then all specific rules for the current object are applied.
# 3. Then a depth-first traversal of the current object is done, repeating steps 1 and 2 with each object found as the current object
# When two rules have equal priority, they are applied in the order they were defined in the file.



# No need to end on blank line

最佳答案

首先,如果您想了解解析,请编写您自己的递归下降解析器。您定义的语言只需要少量的作品。我建议使用 Python 的 tokenize 库来避免将字节流转换为标记流的枯燥任务。

有关实用的解析选项,请继续阅读...

一个快速而肮脏的解决方案是使用 python 本身:

NINETY_NINE = 99       # Defines the constant `NINETY_NINE` to have the value `99`

rules = {
'*': { # Applies to all data
'isYummy': {}, # Everything must be yummy

'chocolate': { # To validate, say `validate("chocolate", object)`
'sweet': {}, # chocolate must be sweet (but not necessarily chocolate.*)

'lindt': { # To validate, say `validate("chocolate.lindt", object)`
'tasty':{} # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)

'*': { # Applies to all data under chocolate.lindt
'smooth': {} # Could also be written smooth()
'creamy': 1 # Level 1 creamy
},
# ...
}
}
}

有几种方法可以实现这个技巧,例如,这是一种使用类的更简洁(尽管有些不寻常)的方法:

class _:
class isYummy: pass

class chocolate:
class sweet: pass

class lindt:
class tasty: pass

class _:
class smooth: pass
class creamy: level = 1
# ...

作为完整解析器的中间步骤,您可以使用“内置电池”的 Python 解析器,它解析 Python 语法并返回 AST。 AST 非常深,有很多(IMO)不必要的级别。您可以通过剔除任何只有一个子节点的节点来将它们过滤成一个更简单的结构。使用这种方法,您可以执行以下操作:

import parser, token, symbol, pprint

_map = dict(token.tok_name.items() + symbol.sym_name.items())

def clean_ast(ast):
if not isinstance(ast, list):
return ast
elif len(ast) == 2: # Elide single-child nodes.
return clean_ast(ast[1])
else:
return [_map[ast[0]]] + [clean_ast(a) for a in ast[1:]]

ast = parser.expr('''{

'*': { # Applies to all data
isYummy: _, # Everything must be yummy

chocolate: { # To validate, say `validate("chocolate", object)`
sweet: _, # chocolate must be sweet (but not necessarily chocolate.*)

lindt: { # To validate, say `validate("chocolate.lindt", object)`
tasty: _, # Applies only to chocolate.lindt (and not to chocolate.lindt.dark, for e.g.)

'*': { # Applies to all data under chocolate.lindt
smooth: _, # Could also be written smooth()
creamy: 1 # Level 1 creamy
}
# ...
}
}
}

}''').tolist()
pprint.pprint(clean_ast(ast))

这种方法确实有其局限性。最终的 AST 仍然有点嘈杂,您定义的语言必须可以解释为有效的 Python 代码。例如,你不能支持这个......

*:
isYummy

...因为此语法不会解析为 python 代码。然而,它的一大优势是您可以控制 AST 转换,因此不可能注入(inject)任意 Python 代码。

关于python - 有关如何解析自定义文件格式的提示,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2036236/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com