gpt4 book ai didi

python - 如何使用 Pandas 或 Spark Dataframe 展平嵌套的 Excel 数据?

转载 作者:行者123 更新时间:2023-12-04 08:04:48 25 4
gpt4 key购买 nike

我有以下 excel 数据,我想将其转换为下面提到的预期平面格式。在预期的格式中,我不想包含聚合列,例如差异和总计。请帮助我使用 python Pandas Dataframe 或 Pyspark Dataframe 完成它。
Excel 中的输入数据为:
enter image description here
预期的最终格式为
enter image description here
df 是我在输入数据上的 Pandas 数据框。下面是输出

df = pd.read_excel("D:/StackOverflow.xlsx",sheet_name = 'Input')
Data Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
0 Type Domestic NaN NaN International NaN NaN
1 Unit/Subunit Jan Feb Diff Jan Feb Diff
2 NaN 2020 2021 NaN 2020 2021 NaN
3 Unit1 NaN NaN NaN NaN NaN NaN
4 SubUnit11 100 130 30 5000 8000 3000
5 SubUnit12 50 80 30 1000 4000 3000
6 SubUnit13 1000 1100 100 100000 100010 10
7 Total Unit1 1150 1310 160 106000 112010 6010
8 Unit2 NaN NaN NaN NaN NaN NaN
9 SubUnit21 1100 1130 30 15000 81100 66100
10 SubUnit22 150 180 30 11000 42000 31000
11 SubUnit23 11000 1110 -9890 1001000 1022010 21010
12 Total Unit2 12250 2420 -9830 1027000 1145110 118110

print (df.index.tolist()[:3])
[0, 1, 2]

print (df.columns.tolist()[:3])
['Data', 'Unnamed: 1', 'Unnamed: 2']

print (df.head(10).to_dict())

{'Data': {0: 'Type', 1: 'Unit/Subunit', 2: nan, 3: 'Unit1', 4: 'SubUnit11', 5: 'SubUnit12', 6: 'SubUnit13', 7: 'Total Unit1', 8: 'Unit2', 9: 'SubUnit21'}, 'Unnamed: 1': {0: 'Domestic ', 1: 'Jan', 2: 2020, 3: nan, 4: 100, 5: 50, 6: 1000, 7: 1150, 8: nan, 9: 1100}, 'Unnamed: 2': {0: nan, 1: 'Feb', 2: 2021, 3: nan, 4: 130, 5: 80, 6: 1100, 7: 1310, 8: nan, 9: 1130}, 'Unnamed: 3': {0: nan, 1: 'Diff', 2: nan, 3: nan, 4: 30, 5: 30, 6: 100, 7: 160, 8: nan, 9: 30}, 'Unnamed: 4': {0: 'International', 1: 'Jan', 2: 2020, 3: nan, 4: 5000, 5: 1000, 6: 100000, 7: 106000, 8: nan, 9: 15000}, 'Unnamed: 5': {0: nan, 1: 'Feb', 2: 2021, 3: nan, 4: 8000, 5: 4000, 6: 100010, 7: 112010, 8: nan, 9: 81100}, 'Unnamed: 6': {0: nan, 1: 'Diff', 2: nan, 3: nan, 4: 3000, 5: 3000, 6: 10, 7: 6010, 8: nan, 9: 66100}}


df1 = pd.read_excel("D:/StackOverflow.xlsx",sheet_name = 'Input',header = [1,2,3])

Type Domestic International
Unit/Subunit Jan Feb Diff Jan Feb Diff
Unnamed: 0_level_2 2020 2021 Unnamed: 3_level_2 2020 2021 Unnamed: 6_level_2
0 Unit1 NaN NaN NaN NaN NaN NaN
1 SubUnit11 100.0 130.0 30.0 5000.0 8000.0 3000.0
2 SubUnit12 50.0 80.0 30.0 1000.0 4000.0 3000.0
3 SubUnit13 1000.0 1100.0 100.0 100000.0 100010.0 10.0
4 Total Unit1 1150.0 1310.0 160.0 106000.0 112010.0 6010.0
5 Unit2 NaN NaN NaN NaN NaN NaN
6 SubUnit21 1100.0 1130.0 30.0 15000.0 81100.0 66100.0
7 SubUnit22 150.0 180.0 30.0 11000.0 42000.0 31000.0
8 SubUnit23 11000.0 1110.0 -9890.0 1001000.0 1022010.0 21010.0
9 Total Unit2 12250.0 2420.0 -9830.0 1027000.0 1145110.0 118110.0

最佳答案

您可以使用 DataFrame.stack reshape 然后删除不必要的行:

d ={('Domestic ', 'Jan', 2020): {'SubUnit11': 100.0, 'SubUnit12': 50.0, 'SubUnit13': 1000.0, 'Total Unit1': 1150.0, 'Unit2': np.nan, 'SubUnit21': 1100.0, 'SubUnit22': 150.0, 'SubUnit23': 11000.0, 'Total Unit2': 12250.0}, ('Domestic ', 'Feb', 2021): {'SubUnit11': 130.0, 'SubUnit12': 80.0, 'SubUnit13': 1100.0, 'Total Unit1': 1310.0, 'Unit2': np.nan, 'SubUnit21': 1130.0, 'SubUnit22': 180.0, 'SubUnit23': 1110.0, 'Total Unit2': 2420.0}, ('Domestic ', 'Diff', 'Unnamed: 3_level_2'):
{'SubUnit11': 30.0, 'SubUnit12': 30.0, 'SubUnit13': 100.0, 'Total Unit1': 160.0, 'Unit2': np.nan, 'SubUnit21': 30.0, 'SubUnit22': 30.0, 'SubUnit23': -9890.0, 'Total Unit2': -9830.0}, ('International', 'Jan', 2020): {'SubUnit11': 5000.0, 'SubUnit12': 1000.0, 'SubUnit13': 100000.0, 'Total Unit1': 106000.0, 'Unit2': np.nan, 'SubUnit21': 15000.0, 'SubUnit22': 11000.0, 'SubUnit23': 1001000.0, 'Total Unit2': 1027000.0},
('International', 'Feb', 2021): {'SubUnit11': 8000.0, 'SubUnit12': 4000.0, 'SubUnit13': 100010.0, 'Total Unit1': 112010.0, 'Unit2': np.nan, 'SubUnit21': 81100.0, 'SubUnit22': 42000.0, 'SubUnit23': 1022010.0, 'Total Unit2': 1145110.0}, ('International', 'Diff', 'Unnamed: 6_level_2'): {'SubUnit11': 3000.0, 'SubUnit12': 3000.0, 'SubUnit13': 10.0, 'Total Unit1': 6010.0, 'Unit2': np.nan, 'SubUnit21': 66100.0, 'SubUnit22': 31000.0, 'SubUnit23': 21010.0, 'Total Unit2': 118110.0}}

df = pd.DataFrame(d)

#added missing Unit1 in sample data
df.loc['Unit1'] = np.nan
df = df.loc[df.index[-1:].tolist() + df.index[:-1].tolist()]
# print (df)
#create new column Unit by index with repalce non matched Unit to NaNs
df['Unit'] = df.index.where(df.index.str.startswith('Unit'))
#forward flling NaNs
df['Unit'] = df['Unit'].ffill()

#add column to index and change order of levels in MultiIndex
df = df.set_index('Unit', append=True).swaplevel(1,0)

#reshape
df = df.stack([1, 2]).rename_axis(['Unit','SubUnit','Month','Year']).reset_index()

#remove unnecessary rows
df = df[df['Month'].ne('Diff') &
~df['Year'].str.startswith('Unnamed', na=False) &
~df['SubUnit'].str.startswith(('Total', 'Unit'), na=False)]
print (df)
Unit SubUnit Month Year Domestic International
2 Unit1 SubUnit11 Feb 2021 130.0 8000.0
3 Unit1 SubUnit11 Jan 2020 100.0 5000.0
6 Unit1 SubUnit12 Feb 2021 80.0 4000.0
7 Unit1 SubUnit12 Jan 2020 50.0 1000.0
10 Unit1 SubUnit13 Feb 2021 1100.0 100010.0
11 Unit1 SubUnit13 Jan 2020 1000.0 100000.0
18 Unit2 SubUnit21 Feb 2021 1130.0 81100.0
19 Unit2 SubUnit21 Jan 2020 1100.0 15000.0
22 Unit2 SubUnit22 Feb 2021 180.0 42000.0
23 Unit2 SubUnit22 Jan 2020 150.0 11000.0
26 Unit2 SubUnit23 Feb 2021 1110.0 1022010.0
27 Unit2 SubUnit23 Jan 2020 11000.0 1001000.0
最后是添加日期时间列进行排序:
df['Date'] = pd.to_datetime(df['Month'] + df['Year'].astype(str), format='%b%Y')

df = df.sort_values(['Unit','SubUnit','Date'], ignore_index=True)
print (df)
Unit SubUnit Month Year Domestic International Date
0 Unit1 SubUnit11 Jan 2020 100.0 5000.0 2020-01-01
1 Unit1 SubUnit11 Feb 2021 130.0 8000.0 2021-02-01
2 Unit1 SubUnit12 Jan 2020 50.0 1000.0 2020-01-01
3 Unit1 SubUnit12 Feb 2021 80.0 4000.0 2021-02-01
4 Unit1 SubUnit13 Jan 2020 1000.0 100000.0 2020-01-01
5 Unit1 SubUnit13 Feb 2021 1100.0 100010.0 2021-02-01
6 Unit2 SubUnit21 Jan 2020 1100.0 15000.0 2020-01-01
7 Unit2 SubUnit21 Feb 2021 1130.0 81100.0 2021-02-01
8 Unit2 SubUnit22 Jan 2020 150.0 11000.0 2020-01-01
9 Unit2 SubUnit22 Feb 2021 180.0 42000.0 2021-02-01
10 Unit2 SubUnit23 Jan 2020 11000.0 1001000.0 2020-01-01
11 Unit2 SubUnit23 Feb 2021 1110.0 1022010.0 2021-02-01

关于python - 如何使用 Pandas 或 Spark Dataframe 展平嵌套的 Excel 数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66272366/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com