gpt4 book ai didi

python - 计算具有多列的 pandas 数据框中的聚合值

转载 作者:行者123 更新时间:2023-12-01 03:08:21 25 4
gpt4 key购买 nike

我有一个包含多列的 Pandas DataFrame。

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
print(df)

first bar baz foo qux \
second one two one two one two one
A -0.093829 -0.159939 -0.386961 -0.367417 0.625646 1.286186 0.429855
B 0.440266 0.345161 1.798363 -1.265215 0.204303 -1.492993 -1.714360
C 0.689076 -1.211060 -0.265888 0.769467 -0.706941 0.086907 -0.892892

first
second two
A -1.006210
B -0.275578
C -0.563757

我想计算每列的平均值和标准差,并按上列分组。一旦我计算了平均值和标准差,我想将较低级别的列加倍,将与统计操作(平均值或标准差)相关的信息添加到列名称中,如“列名称”+“_”+“std”/意思是“。

group_cols = df.groupby(df.columns.get_level_values('first'), axis=1)
list_stat_dfs = []
for key, group in group_cols:
group_descr = group.describe().loc[['mean', 'std'], :] # Get mean and std from single site
group_descr.loc[:, (key, 'stats')] = group_descr.index
group_descr.loc[:, (key, 'first')] = key
group_descr.columns = group_descr.columns.droplevel(0) # Remove upper level column (site_name)
group_descr = group_descr.pivot(columns='stats', index='first') # Rows to columns
col_prod = list(product(group_descr.columns.levels[0], group_descr.columns.levels[1]))
cols = ['_'.join((col[0], col[1])) for col in col_prod]
group_descr.columns = pd.MultiIndex.from_product(([key], cols)) # From multiple columns to single column
group_descr.reset_index(inplace=True)
list_stat_dfs.append(group_descr)

group_descr = pd.concat(list_stat_dfs, axis=1)
print(group_descr)

first bar first baz \
one_mean one_std two_mean two_std one_mean one_std
0 bar 0.507185 1.799053 -0.249692 1.41507 baz -0.147664 0.595927

first foo first \
two_mean two_std one_mean one_std two_mean two_std
0 0.160018 1.405113 foo -0.433644 1.245972 0.254995 0.846983 qux

qux
one_mean one_std two_mean two_std
0 0.667629 0.315417 -0.757989 0.683273

如您所见,我已经能够使用 for 循环和一些代码行来管理它。有人可以以更优化的方式做同样的事情吗?我非常确定,使用 Pandas,只需几行代码就可以完成同样的事情。

最佳答案

我认为你需要获取dfmeanstd,然后concat一起 reshape unstack :

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))

np.random.seed(1000)
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
print(df)
first bar baz foo qux \
second one two one two one two one
A -0.804458 0.320932 -0.025483 0.644324 -0.300797 0.389475 -0.107437
B 0.595036 -0.464668 0.667281 -0.806116 -1.196070 -0.405960 -0.182377
C -0.138422 0.705692 1.271795 -0.986747 -0.334835 -0.099482 0.407192

first
second two
A -0.479983
B 0.103193
C 0.919388

df = pd.concat([df.mean(), df.std()], keys=('mean','std')).unstack(1)
df.index = [[0] * len(df.index), ['_'.join((col[1], col[0])) for col in df.index]]
df = df.unstack()
print (df)
first bar baz \
one_mean one_std two_mean two_std one_mean one_std two_mean
0 -0.115948 0.700018 0.187319 0.596511 0.637865 0.649139 -0.382846

first foo qux \
two_std one_mean one_std two_mean two_std one_mean one_std
0 0.894129 -0.610567 0.507346 -0.038656 0.401191 0.039126 0.32095

first
two_mean two_std
0 0.180866 0.702911

关于python - 计算具有多列的 pandas 数据框中的聚合值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43112514/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com