gpt4 book ai didi

python - 按多列填充缺失的年份组并在 Pandas 中按顺序水平显示多列

转载 作者:太空宇宙 更新时间:2023-11-04 04:00:32 25 4
gpt4 key购买 nike

对于如下数据框,我想在每组 citydistrict 中填充缺失的年份(从 2015 年到 2017 年);然后通过按列分组计算pct:citydistrictyear,最后一步,然后显示valuepct 水平列?

  city district  value  year
0 sh a 2 2015
1 sh a 3 2016
2 sh b 5 2015
3 sh b 3 2016
4 bj c 4 2015
5 bj c 3 2017

到目前为止我做了什么:

<强>1。填补缺失的年份,但还没有工作:

rng = pd.date_range('2015', '2017', freq='YS').dt.year
df = df.apply(lambda x: x.reindex(rng, fill_value = 0))

<强>2。通过按 citydistrict 分组计算 pct:

df['pct'] = df.sort_values('year').groupby(['city', 'district']).value.pct_change()

<强>3。水平显示 valuepct 列,但顺序不是我想要的:

df.pivot_table(columns='year', index=['city','district'], values=['value', 'pct'], fill_value='NaN').reset_index()

到目前为止我得到的输出:

      city   district       pct            value          
year 2015 2016 2017 2015 2016 2017
0 bj c NaN NaN -0.25 4.0 NaN 3
1 sh a NaN 0.5 NaN 2.0 3 NaN
2 sh b NaN -0.4 NaN 5.0 3 NaN

我怎么可能得到这样的预期结果?

city  district      2015         2016         2017
value pct value pct value pct
bj c 4 3
sh a 2 3 0.5
sh b 5 3 -0.4

enter image description here

谢谢。

最佳答案

使用DataFrame.swaplevelDataFrame.sort_index ,还为 reindex 添加了另一个解决方案:

rng = pd.date_range('2015', '2017', freq='YS').year
c = df['city'].unique()
d = df['district'].unique()
mux = pd.MultiIndex.from_product([c, d, rng], names=['city','district','year'])

df = df.set_index(['city','district','year']).reindex(mux)

df['pct'] = df.sort_values('year').groupby(['city', 'district']).value.pct_change()

df = df.pivot_table(columns='year',
index=['city','district'],
values=['value', 'pct'],
fill_value='NaN')

df = df.swaplevel(0,1, axis=1).sort_index(axis=1, level=0)
print (df)
year 2015 2016 2017
pct value pct value pct value
city district
bj c NaN 4.0 0.0 NaN -0.25 3
sh a NaN 2.0 0.5 3 0.00 NaN
b NaN 5.0 -0.4 3 0.00 NaN

编辑:错误:

ValueError: cannot handle a non-unique multi-index!

表示传递给 groupby 的每列都有重复项,因此此处为 ['city','district','year']。解决方案是创造独特的值(value)——例如按总平均值:

print (df)
# city district value year
#0 sh a 2 2015
#0 sh a 20 2015
#1 sh a 3 2016
#2 sh b 5 2015
#3 sh b 3 2016
#4 bj c 4 2015
#5 bj c 3 2017

rng = pd.date_range('2015', '2017', freq='YS').year
c = df['city'].unique()
d = df['district'].unique()
mux = pd.MultiIndex.from_product([c, d, rng], names=['city','district','year'])

print (df.groupby(['city','district','year'])['value'].mean())
city district year
bj c 2015 4
2017 3
sh a 2015 11
2016 3
b 2015 5
2016 3
Name: value, dtype: int64

df = df.groupby(['city','district','year'])['value'].mean().reindex(mux)

print (df)
#city district year
#sh a 2015 11.0
# 2016 3.0
# 2017 NaN
# b 2015 5.0
# 2016 3.0
# 2017 NaN
# c 2015 NaN
# 2016 NaN
# 2017 NaN
#bj a 2015 NaN
# 2016 NaN
# 2017 NaN
# b 2015 NaN
# 2016 NaN
# 2017 NaN
# c 2015 4.0
# 2016 NaN
# 2017 3.0
#Name: value, dtype: float64

关于python - 按多列填充缺失的年份组并在 Pandas 中按顺序水平显示多列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58446571/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com