gpt4 book ai didi

python - 在Python中添加父节点编号关系的表格XML文件

转载 作者:行者123 更新时间:2023-12-01 07:47:26 24 4
gpt4 key购买 nike

我有以下代码尝试解析 XML 文件并转换为表格形式。

import xml.etree.ElementTree as ET
tree = ET.parse('smp.xml')
root = tree.getroot()

for text in root.iter('text'):
print(text.attrib)

for text in root.iter('text'):
print(text.text)

下面是我到目前为止得到的输出,但与我想要的输出相去甚远,因为我是 python 的新手,并且我不知道如何组织这些输出以显示表格,并另外将 pagerowcolumn 父级的列添加到左侧元素对应于每个文本/属性:

>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('smp.xml')
>>> root = tree.getroot()
>>>
>>> for text in root.iter('text'):
... print(text.attrib)
...
{'width': '71.04', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '83.42', 'x': '121.10', 'height': '12.00'}
{'width': '101.07', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '124.82', 'x': '121.10', 'height': '12.00'}
{'width': '140.31', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '207.65', 'x': '121.10', 'height': '12.00'}
{'width': '24.36', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '69.62', 'x': '85.10', 'height': '12.00'}
{'width': '95.42', 'fontName': 'Arial', 'fontStyle': 'Bold', 'fontSize': '12.0', 'y': '239.45', 'x': '276.29', 'height': '12.00'}
{'width': '229.57', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '266.81', 'x': '121.10', 'height': '12.00'}
{'width': '155.71', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '266.81', 'x': '353.94', 'height': '12.00'}
{'width': '165.10', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '294.41', 'x': '85.10', 'height': '12.00'}
{'width': '14.39', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '294.41', 'x': '253.43', 'height': '12.00'}
{'width': '255.64', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '294.41', 'x': '271.04', 'height': '12.00'}
{'width': '432.97', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '501.43', 'x': '85.10', 'height': '12.00'}
{'width': '363.44', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '69.62', 'x': '85.10', 'height': '12.00'}
{'width': '382.36', 'fontName': 'Arial', 'fontSize': '12.0', 'y': '83.42', 'x': '85.10', 'height': '12.00'}

>>> for text in root.iter('text'):
... print(text.text)
...
achene
capsule
caryopsis
cypsela
fibrous drupe
follicle
legume
loment
nut
samara
schizocarp
silicle
utricle

这是我的预期输出:

╔══════╦═══════╦═════╦════════╦════════════╦══════════╦══════════╦════════╦════════╦════════╦════════╦═══════════╗
║ page ║ index ║ row ║ column ║ text ║ fontName ║ fontSize ║ x ║ y ║ width ║ height ║ fontStyle ║
╠══════╬═══════╬═════╬════════╬════════════╬══════════╬══════════╬════════╬════════╬════════╬════════╬═══════════╣
║ 0 ║ 0 ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║
║ 1 ║ 1 ║ 0 ║ 0 ║ achene ║ Arial ║ 12 ║ 121.1 ║ 83.42 ║ 71.04 ║ 12 ║ ║
║ 1 ║ 1 ║ 1 ║ 0 ║ capsule ║ Arial ║ 12 ║ 121.1 ║ 124.82 ║ 101.07 ║ 12 ║ ║
║ 1 ║ 1 ║ 2 ║ 0 ║ caryopsis ║ Arial ║ 12 ║ 121.1 ║ 207.65 ║ 140.31 ║ 12 ║ ║
║ 2 ║ 2 ║ 0 ║ 0 ║ cypsela ║ Arial ║ 12 ║ 85.1 ║ 69.62 ║ 24.36 ║ 12 ║ ║
║ 3 ║ 3 ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║
║ 4 ║ 4 ║ 0 ║ 0 ║ fibrous ║ Arial ║ 12 ║ 276.29 ║ 239.45 ║ 95.42 ║ 12 ║ Bold ║
║ 4 ║ 4 ║ 1 ║ 1 ║ follicle ║ Arial ║ 12 ║ 121.1 ║ 266.81 ║ 229.57 ║ 12 ║ ║
║ 4 ║ 4 ║ 1 ║ 1 ║ legume ║ Arial ║ 12 ║ 353.94 ║ 266.81 ║ 155.71 ║ 12 ║ ║
║ 4 ║ 4 ║ 2 ║ 2 ║ loment ║ Arial ║ 12 ║ 85.1 ║ 294.41 ║ 165.1 ║ 12 ║ ║
║ 4 ║ 4 ║ 2 ║ 2 ║ nut ║ Arial ║ 12 ║ 253.43 ║ 294.41 ║ 14.39 ║ 12 ║ ║
║ 4 ║ 4 ║ 2 ║ 2 ║ samara ║ Arial ║ 12 ║ 271.04 ║ 294.41 ║ 255.64 ║ 12 ║ ║
║ 4 ║ 4 ║ 3 ║ 0 ║ schizocarp ║ Arial ║ 12 ║ 85.1 ║ 501.43 ║ 432.97 ║ 12 ║ ║
║ 5 ║ 5 ║ 0 ║ 0 ║ silicle ║ Arial ║ 12 ║ 85.1 ║ 69.62 ║ 363.44 ║ 12 ║ ║
║ 5 ║ 5 ║ 1 ║ 1 ║ utricle ║ Arial ║ 12 ║ 85.1 ║ 83.42 ║ 382.36 ║ 12 ║ ║
║ 6 ║ 6 ║ ║ ║ ║ ║ ║ ║ ║ ║ ║ ║
╚══════╩═══════╩═════╩════════╩════════════╩══════════╩══════════╩════════╩════════╩════════╩════════╩═══════════╝

这是 xml 文件:

<document>
<page index="0"/>
<page index="1">
<row><column><text fontName="Arial" fontSize="12.0" x="121.10" y="83.42" width="71.04" height="12.00">achene</text></column></row>
<row><column><text fontName="Arial" fontSize="12.0" x="121.10" y="124.82" width="101.07" height="12.00">capsule</text></column></row>
<row><column><text fontName="Arial" fontSize="12.0" x="121.10" y="207.65" width="140.31" height="12.00">caryopsis</text></column></row>
</page>
<page index="2">
<row><column><text fontName="Arial" fontSize="12.0" x="85.10" y="69.62" width="24.36" height="12.00">cypsela</text></column></row>
</page>
<page index="3"/>
<page index="4">
<row><column><text fontName="Arial" fontSize="12.0" fontStyle="Bold" x="276.29" y="239.45" width="95.42" height="12.00">fibrous drupe</text></column></row>
<row><column><text fontName="Arial" fontSize="12.0" x="121.10" y="266.81" width="229.57" height="12.00">follicle</text></column>
<column><text fontName="Arial" fontSize="12.0" x="353.94" y="266.81" width="155.71" height="12.00">legume</text></column></row>
<row><column><text fontName="Arial" fontSize="12.0" x="85.10" y="294.41" width="165.10" height="12.00">loment – a type of indehiscent legume</text></column>
<column><text fontName="Arial" fontSize="12.0" x="253.43" y="294.41" width="14.39" height="12.00">nut</text></column>
<column><text fontName="Arial" fontSize="12.0" x="271.04" y="294.41" width="255.64" height="12.00">samara</text></column></row>
<row><column><text fontName="Arial" fontSize="12.0" x="85.10" y="501.43" width="432.97" height="12.00">schizocarp</text></column></row>
</page>
<page index="5">
<row><column><text fontName="Arial" fontSize="12.0" x="85.10" y="69.62" width="363.44" height="12.00">silicle</text></column></row>
<row><column><text fontName="Arial" fontSize="12.0" x="85.10" y="83.42" width="382.36" height="12.00">utricle</text></column></row>
</page>
<page index="6"/>
</document>

预先感谢您的帮助。

最佳答案

这应该让你足够接近:

import pandas as pd
import xml.etree.ElementTree as ET

etree = ET.fromstring(xml_string)

df = pd.DataFrame()

for j in etree.iter('page'):
for i in j.iter('text'):
dfcols = ['index','text','fontName','fontSize','x','y','width','height','fontStyle']
df = df.append(pd.Series([j.get('index'),i.text,i.get('fontName'),i.get('fontSize'),i.get('x'),i.get('y'),i.get('width'),i.get('height'),i.get('fontStyle')],index=dfcols), ignore_index=True)

df = df[dfcols]
df.head()

输出:

 index  text          fontName fontSize x       y      width    height  fontStyle
0 1 achene Arial 12.0 121.10 83.42 71.04 12.00 None
1 1 capsule Arial 12.0 121.10 124.82 101.07 12.00 None
2 1 caryopsis Arial 12.0 121.10 207.65 140.31 12.00 None
3 2 cypsela Arial 12.0 85.10 69.62 24.36 12.00 None
4 4 fibrous drupe Arial 12.0 276.29 239.45 95.42 12.00 Bold

关于python - 在Python中添加父节点编号关系的表格XML文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56402748/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com