gpt4 book ai didi

Python 高效地从 XML 中提取嵌套元素

转载 作者:行者123 更新时间:2023-12-01 23:43:56 26 4
gpt4 key购买 nike

我正在尝试解析大量包含大量嵌套元素的 XML 文件,以收集特定信息供以后使用。由于文件数量众多,我试图尽可能高效地执行此操作以减少处理时间。我可以使用 xpath 提取所需的信息,如下所示,但效率似乎很低。尤其是必须运行第二个 for 循环以使用另一个 xpath 搜索提取结果值。我读了这篇文章Efficient way to iterate through xml elements和这篇文章 High-performance XML parsing in Python with lxml但不明白如何将它应用到我的用例中。有没有更有效的方法可以用来获得下面所需的输出?我可以通过单个 xpath 查询收集我需要的信息吗?

所需的解析格式:

Id             Object    Type             Result
Packages total totalPackages 1200
DeliveryMethod priority packagesSent 100
DeliveryMethod express packagesSent 200
DeliveryMethod ground packagesSent 300
DeliveryMethod priority packagesReceived 100
DeliveryMethod express packagesReceived 200
DeliveryMethod ground packagesReceived 300

XML 样本:

<?xml version="1.0" encoding="utf-8"?>
<Data>
<Location localDn="Chicago"/>
<Info Id="Packages">
<job jobId="1"/>
<Type pos="1">totalPackages</Type>
<Value Object="total">
<result pos="1">1200</result>
</Value>
</Info>
<Info Id="DeliveryMethod">
<job jobId="1"/>
<Type pos="1">packagesSent</Type>
<Type pos="2">packagesReceived</Type>
<Value Object="priority">
<result pos="1">100</result>
<result pos="2">100</result>
</Value>
<Value Object="express">
<result pos="1">200</result>
<result pos="2">200</result>
</Value>
<Value Object="ground">
<result pos="1">300</result>
<result pos="2">300</result>
</Value>
</Info>
</Data>

我的方法:

from lxml import etree

xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()

for elem in tree.xpath('//*'):
if elem.tag == 'Type':
for value in tree.xpath(f'//*/Info[@Id="{elem.getparent().attrib["Id"]}"]/Value/result[@pos="{elem.attrib["pos"]}"]'):
print(elem.getparent().attrib['Id'], value.getparent().attrib['Object'], elem.text, value.text)

当前输出:

Packages total totalPackages 1200
DeliveryMethod priority packagesSent 100
DeliveryMethod express packagesSent 200
DeliveryMethod ground packagesSent 300
DeliveryMethod priority packagesReceived 100
DeliveryMethod express packagesReceived 200
DeliveryMethod ground packagesReceived 300

是否可以通过tree.xpath('//*')迭代获取所有信息?

最佳答案

其中一项优化不会像您现在使用 tree.xpath('//*') 那样遍历所有标签并使用 if 语句进行检查。这可以替换为 tree.xpath('//Type')

接下来需要优化的是遍历值。无需一遍又一遍地遍历所有 Value (tree.xpath('//Value')),您可以获得所有 Values标记 Typeelem.xpath('./following-sibling::Value')

的 sibling
from lxml import etree

xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()

for elem in tree.xpath('//Type'):
_id = elem.getparent().attrib["Id"]
_type = elem.text
_position = elem.attrib["pos"]
values = elem.xpath('./following-sibling::Value')
for value in values:
_object = value.attrib['Object']
_result = value.xpath(f'./result[@pos={_position}]/text()')[0]
print(_id, _type, _object, _result)

这将打印出:

Packages totalPackages total 1200
DeliveryMethod packagesSent priority 100
DeliveryMethod packagesSent express 200
DeliveryMethod packagesSent ground 300
DeliveryMethod packagesReceived priority 100
DeliveryMethod packagesReceived express 200
DeliveryMethod packagesReceived ground 300

编辑

这是针对特定情况的解决方案,我们确定 Value 标签中 result 的数量等于 Type 标签的数量是 Value 的兄弟,另外解决方案假设 Typeresult 由相同的 pos 属性排序。

请记住,这是非常具体的解决方案,而不是通用的解决方案。

from lxml import etree

xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()

for elem in tree.xpath('//Type'):
_id = elem.getparent().attrib["Id"]
_type = elem.text
_objects = elem.xpath('./following-sibling::Value/@Object')
_results = elem.xpath('./following-sibling::Value/result/text()')
for _object, _result in zip(_objects, _results):
print(_id, _type, _object, _result)

输出:

Packages totalPackages total 1200
DeliveryMethod packagesSent priority 100
DeliveryMethod packagesSent express 100
DeliveryMethod packagesSent ground 200
DeliveryMethod packagesReceived priority 100
DeliveryMethod packagesReceived express 100
DeliveryMethod packagesReceived ground 200

关于Python 高效地从 XML 中提取嵌套元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64555247/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com