gpt4 book ai didi

python - 通过 Pandas 从多级 Excel 文件整理数据

转载 作者:太空狗 更新时间:2023-10-30 01:12:46 25 4
gpt4 key购买 nike

我想从这样的 Excel 文件中生成整洁的数据,其中包含三个级别的“合并”标题:

enter image description here

Pandas 可以很好地读取带有多级标题的文件:

# df = pandas.read_excel('test.xlsx', header=[0,1,2])

为了可重复性,您可以复制粘贴:

df = pandas.DataFrame({('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'a'): {1: 'aX', 2: 'aY'}, ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'b'): {1: 'bX', 2: 'bY'}, ('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'c'): {1: 'cX', 2: 'cY'}, ('level1_1', 'level2_1', 'level3_1'): {1: 1, 2: 10}, ('level1_1', 'level2_1', 'level3_2'): {1: 2, 2: 20}, ('level1_1', 'level2_2', 'level3_1'): {1: 3, 2: 30}, ('level1_1', 'level2_2', 'level3_2'): {1: 4, 2: 40}, ('level1_2', 'level2_1', 'level3_1'): {1: 5, 2: 50}, ('level1_2', 'level2_1', 'level3_2'): {1: 6, 2: 60}, ('level1_2', 'level2_2', 'level3_1'): {1: 7, 2: 70}, ('level1_2', 'level2_2', 'level3_2'): {1: 8, 2: 80}})

我想对此进行规范化,以便水平标题位于可变行中,但保留 a、b 和 c 列作为列:

Desired output

如果没有多级 header ,我会执行 pandas.melt(df, id_vars=['a', 'b', 'c']) 来获得我想要的。 pandas.melt(df) 给了我想要的三个可变列,但显然不保留 a、b 和 c 列。

最佳答案

它应该像这样简单:

wide_df = pandas.read_excel(xlfile, sheetname, header=[0, 1, 2], index_col=[0, 1, 2, 3])

long_df = wide_df.stack().stack().stack()

这是一个带有模型 CSV 文件的示例(请注意第 4 行标记索引,第一列标记标题级别):

from io import StringIO
from textwrap import dedent

import pandas

mockcsv = StringIO(dedent("""\
num,,,this1,this1,this1,this1,that1,that1,that1,that1
let,,,thisA,thisA,thatA,thatA,thisB,thisB,thatB,thatB
animal,,,cat,dog,bird,lizard,cat,dog,bird,lizard
a,b,c,,,,,,,,
a1,b1,c1,x1,x2,x3,x4,x5,x6,x7,x8
a1,b1,c2,y1,y2,y3,y4,y5,y6,y7,y8
a1,b2,c1,z1,z2,z3,z4,z5,6z,zy,z8
"""))


wide_df = pandas.read_csv(mockcsv, index_col=[0, 1, 2], header=[0, 1, 2])
long_df = wide_df.stack().stack().stack()

所以 wide_df 看起来像这样:

num      this1                  that1                 
let thisA thatA thisB thatB
animal cat dog bird lizard cat dog bird lizard
a b c
a1 b1 c1 x1 x2 x3 x4 x5 x6 x7 x8
c2 y1 y2 y3 y4 y5 y6 y7 y8
b2 c1 z1 z2 z3 z4 z5 6z zy z8

long_df

a   b   c   animal  let    num  
a1 b1 c1 bird thatA this1 x3
thatB that1 x7
cat thisA this1 x1
thisB that1 x5
dog thisA this1 x2
thisB that1 x6
lizard thatA this1 x4
thatB that1 x8
c2 bird thatA this1 y3
thatB that1 y7
cat thisA this1 y1
thisB that1 y5
dog thisA this1 y2
thisB that1 y6
lizard thatA this1 y4
thatB that1 y8
b2 c1 bird thatA this1 z3
thatB that1 zy
cat thisA this1 z1
thisB that1 z5
dog thisA this1 z2
thisB that1 6z
lizard thatA this1 z4
thatB that1 z8

使用 OP 中显示的文字数据,您可以通过执行以下操作在不修改任何内容的情况下获得此信息:

index_names = ['a', 'b', 'c']
col_names = ['Level1', 'Level2', 'Level3']
df = (
pandas.read_excel('Book1.xlsx', header=[0, 1, 2], index_col=[0, 1, 2, 3])
.reset_index(level=0, drop=True)
.rename_axis(index_names, axis='index')
.rename_axis(col_names, axis='columns')
.stack()
.stack()
.stack()
.to_frame()
)

我认为棘手的部分是检查您的每个文件以确定 index_names 应该是什么。

关于python - 通过 Pandas 从多级 Excel 文件整理数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40319532/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com