gpt4 book ai didi

python - 对多索引 Pandas 列求和

转载 作者:太空宇宙 更新时间:2023-11-04 02:48:31 24 4
gpt4 key购买 nike

我想创建一个数据框,其中列(年、季度、月)和索引(某些属性)都是分层的,即多索引。我想对某些级别进行总结,例如对属于一个季度的整个月进行总结。在 Pandas 中,可以通过例如做这样的事情。以下行:

# Axis 1 = columns, level 0 = year, level 1 = quarter
df.sum(axis=1, level=[0, 1]

这一直有效,直到在一些奇怪的情况下索引不再被正确识别,触发错误消息 No axis named 1 for object type <class 'pandas.core.series.Series'> .

在下面的代码中,我创建了两个相同的数据帧(两个轴上的多索引),只有一个区别:df1创建时未填充,df2用ones创建时直接填充。求和适用于 df2 , 但不是 df1 .我不明白,后台发生了什么,有人可以给我指出一个解决方案来理解这种差异吗?

import pandas as pd
import numpy as np

cols = [(y, divmod(m - 1, 3)[0] + 1, m)
for y in list(range(2011, 2014)) for m in list(range(1, 13))]

inds = [(a, b, c)
for a in ["a1", "a2"] for b in ["b1", "b2"] for c in ["c1", "c2"]]

df1 = pd.DataFrame(index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
columns=pd.MultiIndex.from_tuples(cols, names=["year", "quarter", "month"]))

df2 = pd.DataFrame(np.ones(df1.shape),
index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
columns=pd.MultiIndex.from_tuples(cols, names=["year", "quarter", "month"]))

for (col, ind) in [(col, ind) for ind in df1.index.values for col in df1.columns.values]:
entry = np.random.rand()
df1.loc[ind, col] = entry
df2.loc[ind, col] = entry

try:
df1.sum(axis=1, level=[0, 1])
print("Sum over df1 did work")
except:
print("Sum over df1 did not work...")

try:
df2.sum(axis=1, level=[0, 1])
print("Sum over df2 did work")
except:
print("Sum over df2 did not work...")

PS:发现了一些提示,df1 中条目的类型是float , 在 df2它是 np.float64 , 但这仍然无济于事......

最佳答案

df1 中的所有值都有问题是 object s,显然是什么string s,但这里是 <class 'float'> :

print (df1.dtypes)
year quarter month
2011 1 1 object
2 object
3 object
2 4 object
5 object
6 object
3 7 object
8 object
9 object
4 10 object

print (df2.dtypes)
year quarter month
2011 1 1 float64
2 float64
3 float64
2 4 float64
5 float64
6 float64
3 7 float64
8 float64

所以类型转换作品:

try:
df1.astype(float).sum(axis=1, level=[0, 1])
print("Sum over df1 did work")
except:
print("Sum over df1 did not work...")

try:
df2.sum(axis=1, level=[0, 1])
print("Sum over df2 did work")
except:
print("Sum over df2 did not work...")
Sum over df1 did work
Sum over df2 did work

for (col, ind) in [(col, ind) for ind in df1.index.values for col in df1.columns.values]:
entry = np.random.rand()
df1.loc[ind, col] = entry
print (type(df1.loc[ind, col]))
df2.loc[ind, col] = entry
print (type(df2.loc[ind, col]))

<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>

最好的是创建DataFrame通过 numpy 数组,然后一切正常:

df1 = pd.DataFrame(data = np.random.rand(len(inds), len(cols)),
index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
columns=pd.MultiIndex.from_tuples(cols, names=["year","quarter","month"]))


try:
df1.sum(axis=1, level=[0, 1])
print("Sum over df1 did work")
except:
print("Sum over df1 did not work...")
Sum over df1 did work

关于python - 对多索引 Pandas 列求和,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44514183/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com