gpt4 book ai didi

python - 无法解析 XML 并将数据导入 Pandas 数据框

转载 作者:太空宇宙 更新时间:2023-11-04 04:58:43 25 4
gpt4 key购买 nike

我正在尝试从一个 XML 文件导入数据,该文件包含来自运动测试的逐次呼吸数据。XML 结构如下(简化以显示一般结构):

<?xml version="1.0"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
<Worksheet ss:Name="MetasoftStudio">
<Table ss:ExpandedColumnCount="21" ss:ExpandedRowCount="458" x:FullColumns="1" x:FullRows="1" ss:StyleID="s62" ss:DefaultColumnWidth="53">
<Column ss:StyleID="s62" ss:AutoFitWidth="0" ss:Width="137"/>
<Column ss:StyleID="s62" ss:AutoFitWidth="0" ss:Width="97"/>
<Column ss:StyleID="s62" ss:AutoFitWidth="0" ss:Width="137"/>
<Row ss:AutoFitHeight="0" ss:Height="26">
<Cell ss:StyleID="Default"><Data ss:Type="String">t</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">Phase</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">Marker</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">V'O2</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">V'O2/kg</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">V'O2/HR</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">HR</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">WR</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">V'E/V'O2</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">V'E/V'CO2</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">RER</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">V'E</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">BF</Data></Cell>
</Row>
<Row ss:Height="15">
<Cell ss:StyleID="Default"><Data ss:Type="String">h:mm:ss</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">L/min</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">ml/min/kg</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">ml</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">/min</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">W</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">L/min</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">/min</Data></Cell>
</Row>
<Row ss:Height="15">
<Cell ss:StyleID="Default"><Data ss:Type="String">0:00:06</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">Rest</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">0.27972413565454501</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">4.3706896196022598</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">4.5856415681072953</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">61</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">0</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">27.002532271037801</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">26.4113108545688</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">1.0223851598932201</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">10.155340000000001</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">18.07</Data></Cell>
</Row>
</Table>
</Worksheet>
</Workbook>

我使用 lxml 解析和遍历 XML 文件,然后提取每个“单元格”中的“数据”,将其附加到列表中,然后将该列表附加到父列表中(给出我是每一行的嵌套列表)使用代码:

from lxml import etree, objectify
import pandas as pd

with open('Python/cortex.xml') as infile:
xml_file = infile.read()

root = objectify.fromstring(xml_file)

header = []
data = []

for row in root.Worksheet.Table.getchildren():
temp_row = []
if not row.tag == '{urn:schemas-microsoft-com:office:spreadsheet}Column':
for cell in row.getchildren():
temp_row.append(cell.Data)
data.append(temp_row)
header = data.pop(0) #remove the first 'row' and store in header list
del data[0] #remove 2nd line of superfluous data

第一行给出标题,因此我将其pop 放入其自己的列表中,第 2 行包含每个变量的单位,因此我将其删除。到目前为止一切正常(或者看起来如此)......

现在我需要将它放入 pd 数据框中以开始使用它。如果我去 df = pd.DataFrame(data, columns=header) 并且我 print(df) 我得到:ValueError:缓冲区的维数错误(预期为 1,得到 32)

好吧,不确定那里发生了什么......如果我在没有分配标题的情况下创建 df 并打印我得到的:

              0           1       2                        3   \
0 [[[0:00:06]]] [[[Rest]]] [[[]]] [[[0.279724135654545]]]
1 [[[0:00:09]]] [[[Rest]]] [[[]]] [[[0.465136232899829]]]
2 [[[0:00:13]]] [[[Rest]]] [[[]]] [[[0.357975433456662]]]
3 [[[0:00:19]]] [[[Rest]]] [[[]]] [[[0.543332419057909]]]
4 [[[0:00:24]]] [[[Rest]]] [[[]]] [[[0.374604578743889]]]

这看起来不对! lists in lists的这些lists都是从哪里来的!如果我遍历并打印嵌套列表 data,它会完美打印,但一旦我尝试将其转换为 df,就会出现问题。

任何人都可以告诉我发生了什么以及如何将数据放入 pd df 中吗?如果有比我的方法更好的方法,那么我很乐意试一试。

最佳答案

您可以创建列表列表,然后通过构造函数创建 DataFrame。使用 this solution 进行解析:

from lxml import etree

with (open('test.xml','r')) as f:
doc = etree.parse(f)

namespaces={'o':'urn:schemas-microsoft-com:office:office',
'x':'urn:schemas-microsoft-com:office:excel',
'ss':'urn:schemas-microsoft-com:office:spreadsheet'}

L = []
ws = doc.xpath('/ss:Workbook/ss:Worksheet', namespaces=namespaces)
if len(ws) > 0:
tables = ws[0].xpath('./ss:Table', namespaces=namespaces)
if len(tables) > 0:
rows = tables[0].xpath('./ss:Row', namespaces=namespaces)
for row in rows:
tmp = []
cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)
for cell in cells:
# print(cell.text);
tmp.append(cell.text)
L.append(tmp)
print (L)

[['t', 'Phase', 'Marker', "V'O2", "V'O2/kg", "V'O2/HR", 'HR', 'WR', 
"V'E/V'O2", "V'E/V'CO2", 'RER', "V'E", 'BF'],
['h:mm:ss', None, None, 'L/min', 'ml/min/kg', 'ml',
'/min', 'W', None, None, None, 'L/min', '/min'],
['0:00:06', 'Rest', None, '0.27972413565454501', '4.3706896196022598',
'4.5856415681072953', '61', '0', '27.002532271037801', '26.4113108545688',
'1.0223851598932201', '10.155340000000001', '18.07']]

df = pd.DataFrame(L[2:], columns=L[0])
print (df)
t Phase Marker V'O2 V'O2/kg \
0 0:00:06 Rest None 0.27972413565454501 4.3706896196022598

V'O2/HR HR WR V'E/V'O2 V'E/V'CO2 \
0 4.5856415681072953 61 0 27.002532271037801 26.4113108545688

RER V'E BF
0 1.0223851598932201 10.155340000000001 18.07

关于python - 无法解析 XML 并将数据导入 Pandas 数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46387091/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com