gpt4 book ai didi

python - 高级 Python 正则表达式 : how to evaluate and extract nested lists and numbers from a multiline string?

转载 作者:太空狗 更新时间:2023-10-29 18:27:08 24 4
gpt4 key购买 nike

我试图将元素与多行字符串分开:

lines = '''c0 c1 c2 c3 c4 c5
0 10 100.5 [1.5, 2] [[10, 10.4], [c, 10, eee]] [[a , bg], [5.5, ddd, edd]] 100.5
1 20 200.5 [2.5, 2] [[20, 20.4], [d, 20, eee]] [[a , bg], [7.5, udd, edd]] 200.5'''

我的目标是得到一个列表 lst 这样:

# first value is index
lst[0] = ['c0', 'c1', 'c2', 'c3', 'c4','c5']
lst[1] = [0, 10, 100.5, [1.5, 2], [[10, 10.4], ['c', 10, 'eee']], [['a' , 'bg'], [5.5, 'ddd', 'edd']], 100.5 ]
lst[2] = [1, 20, 200.5, [2.5, 2], [[20, 20.4], ['d', 20, 'eee']], [['a' , 'bg'], [7.5, 'udd', 'edd']], 200.5 ]

到目前为止我的尝试是这样的:

import re

lines = '''c0 c1 c2 c3 c4 c5
0 10 100.5 [1.5, 2] [[10, 10.4], [c, 10, eee]] [[a , bg], [5.5, ddd, edd]] 100.5
1 20 200.5 [2.5, 2] [[20, 20.4], [d, 20, eee]] [[a , bg], [7.5, udd, edd]] 200.5'''


# get n elements for n lines and remove empty lines
lines = lines.split('\n')
lines = list(filter(None,lines))

lst = []
lst.append(lines[0].split())


for i in range(1,len(lines)):
change = re.sub('([a-zA-Z]+)', r"'\1'", lines[i])
lst.append(change)

for i in lst[1]:
print(i)

如何修复正则表达式?

更新
测试数据集

data = """
orig shifted not_equal cumsum lst
0 10 NaN True 1 [[10, 10.4], [c, 10, eee]]
1 10 10.0 False 1 [[10, 10.4], [c, 10, eee]]
2 23 10.0 True 2 [[10, 10.4], [c, 10, eee]]
"""

# Gives: ValueError: malformed node or string:

data = """
Name Result Value
0 Name1 5 2
1 Name1 5 3
2 Name2 11 1
"""
# gives same error


data = """
product value
0 A 25
1 B 45
2 C 15
3 C 14
4 C 13
5 B 22
"""
# gives same error

data = '''
c0 c1
0 10 100.5
1 20 200.5
'''
# works perfect

最佳答案

如评论中所述,使用正则表达式无法完成此任务。正则表达式根本无法处理嵌套结构。您需要的是解析器。

创建解析器的方法之一是 PEG ,它允许您以声明性语言设置 token 列表及其相互关系。然后将此解析器定义转换为可以处理所描述输入的实际解析器。解析成功后,您将得到一个正确嵌套了所有项的树结构。

出于演示目的,我使用了 JavaScript 实现 peg.js,它有一个 online demo page您可以在其中针对某些输入实时测试解析器。此解析器定义:

{
// [value, [[delimiter, value], ...]] => [value, value, ...]
const list = values => [values[0]].concat(values[1].map(i => i[1]));
}
document
= line*
line "line"
= value:(item (whitespace item)*) whitespace? eol { return list(value) }
item "item"
= number / string / group
group "group"
= "[" value:(item (comma item)*) whitespace? "]" { return list(value) }
comma "comma"
= whitespace? "," whitespace?
number "number"
= value:$[0-9.]+ { return +value }
string "string"
= $([^ 0-9\[\]\r\n,] [^ \[\]\r\n,]*)
whitespace "whitespace"
= $" "+
eol "eol"
= [\r]? [\n] / eof
eof "eof"
= !.

可以理解这种输入:

c0 c1 c2 c3 c4 c50   10 100.5 [1.5, 2]     [[10, 10.4], [c, 10, eee]]  [[a , bg], [5.5, ddd, edd]]1   20 200.5 [2.5, 2]     [[20, 20.4], [d, 20, eee]]  [[a , bg], [7.5, udd, edd1]]

and produces this object tree (JSON notation):

[
["c0", "c1", "c2", "c3", "c4", "c5"],
[0, 10, 100.5, [1.5, 2], [[10, 10.4], ["c", 10, "eee"]], [["a", "bg"], [5.5, "ddd", "edd"]]],
[1, 20, 200.5, [2.5, 2], [[20, 20.4], ["d", 20, "eee"]], [["a", "bg"], [7.5, "udd", "edd1"]]]
]

  • 一组行,
  • 每一个都是值的数组,
  • 其中的每一个都可以是一个数字,或者一个字符串,或者另一个值数组

然后您的程序可以处理此树结构。

上面的示例可以与 node.js 一起使用,将您的输入转换为 JSON。下面的最小 JS 程序从 STDIN 接受数据并将解析结果写入 STDOUT:

// reference the parser.js file, e.g. downloaded from https://pegjs.org/online
const parser = require('./parser');

var chunks = [];

// handle STDIN events to slurp up all the input into one big string
process.stdin.on('data', buffer => chunks.push(buffer.toString()));
process.stdin.on('end', function () {
var text = chunks.join('');
var data = parser.parse(text);
var json = JSON.stringify(data, null, 4);
process.stdout.write(json);
});

// start reading from STDIN
process.stdin.resume();

将它保存为 text2json.js 或类似的东西并将一些文本重定向(或管道)到其中:

# input redirection (this works on Windows, too)
node text2json.js < input.txt > output.json

# common alternative, but I'd recommend input redirection over this
cat input.txt | node text2json.js > output.json

还有用于 Python 的 PEG 解析器生成器,例如 https://github.com/erikrose/parsimonious .解析器创建语言因实现而异,所以以上只能用于peg.js,但原理是完全一样的。


编辑 我深入研究了 Parsimonious,并在 Python 代码中重新创建了上述解决方案。方法是一样的,解析器语法是一样的,只是有一些微小的语法变化。

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

grammar = Grammar(
r"""
document = line*
line = whitespace? item (whitespace item)* whitespace? eol
item = group / number / boolean / string
group = "[" item (comma item)* whitespace? "]"
comma = whitespace? "," whitespace?
number = "NaN" / ~"[0-9.]+"
boolean = "True" / "False"
string = ~"[^ 0-9\[\]\r\n,][^ \[\]\r\n,]*"
whitespace = ~" +"
eol = ~"\r?\n" / eof
eof = ~"$"
""")

class DataExtractor(NodeVisitor):
@staticmethod
def concat_items(first_item, remaining_items):
""" helper to concat the values of delimited items (lines or goups) """
return first_item + list(map(lambda i: i[1][0], remaining_items))

def generic_visit(self, node, processed_children):
""" in general we just want to see the processed children of any node """
return processed_children

def visit_line(self, node, processed_children):
""" line nodes return an array of their processed_children """
_, first_item, remaining_items, _, _ = processed_children
return self.concat_items(first_item, remaining_items)

def visit_group(self, node, processed_children):
""" group nodes return an array of their processed_children """
_, first_item, remaining_items, _, _ = processed_children
return self.concat_items(first_item, remaining_items)

def visit_number(self, node, processed_children):
""" number nodes return floats (nan is a special value of floats) """
return float(node.text)

def visit_boolean(self, node, processed_children):
""" boolean nodes return return True or False """
return node.text == "True"

def visit_string(self, node, processed_children):
""" string nodes just return their own text """
return node.text

DataExtractor 负责遍历树并从节点中提取数据,返回字符串、数字、 bool 值或 NaN 的列表。

concat_items() 函数执行与上面 Javascript 代码中的 list() 函数相同的任务,其他函数在 peg.js 中也有它们的等价物方法,除了 peg.js 将它们直接集成到解析器定义中,Parsimonious 期望定义在一个单独的类中,所以相比之下它有点冗长,但还不错。

用法,假设一个名为“data.txt”的输入文件,也反射(reflect)了 JS 代码:

de = DataExtractor()

with open("data.txt", encoding="utf8") as f:
text = f.read()

tree = grammar.parse(text)
data = de.visit(tree)
print(data)

输入:

orig shifted not_equal cumsum lst0 10 NaN True 1 [[10, 10.4], [c, 10, eee]]1 10 10.0 False 1 [[10, 10.4], [c, 10, eee]]2 23 10.0 True 2 [[10, 10.4], [c, 10, eee]]

输出:

[    ['orig', 'shifted', 'not_equal', 'cumsum', 'lst'],    [0.0, 10.0, nan, True, 1.0, [[10.0, 10.4], ['c', 10.0, 'eee']]],    [1.0, 10.0, 10.0, False, 1.0, [[10.0, 10.4], ['c', 10.0, 'eee']]],     [2.0, 23.0, 10.0, True, 2.0, [[10.0, 10.4], ['c', 10.0, 'eee']]]]

从长远来看,我希望这种方法比 regex hackery 更易于维护和灵活。例如,添加对 NaN 和 bool 值(上面的 peg.js-Solution 没有 - 它们被解析为字符串)的显式支持很容易。

关于python - 高级 Python 正则表达式 : how to evaluate and extract nested lists and numbers from a multiline string?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53531519/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com