gpt4 book ai didi

python - 如何在 pandas 中交叉表的特定行和列上执行计算?

转载 作者:太空宇宙 更新时间:2023-11-03 18:49:43 25 4
gpt4 key购买 nike

import pandas as pd
import numpy as np

c1 = np.repeat(['a','b'], [50, 50], axis=0)
c2 = list('xy'*50)
c3 = np.repeat(['G1','G2'], [50, 50], axis=0)
np.random.shuffle(c3)
c4=np.repeat([1,2], [50,50],axis=0)
np.random.shuffle(c4)
val = np.random.rand(100)

df = pd.DataFrame({'c1':c1, 'c2':c2, 'c3':c3, 'c4':c4, 'val':val})

table = pd.crosstab([df.c1,df.c2],[df.c3,df.c4])
c3 G1 G2
c4 1 2 1 2
c1 c2
a x 3 11 5 6
y 9 5 7 4
b x 5 7 11 2
y 5 5 5 10

对于每个组(G1、G2),是否可以仅针对 c4==2 计算 ax - bxay - by > 并将结果保存在数据框中?:

x G1  4
y G1 0
x G2 4
y G2 -6

编辑:如果df采用这种格式,我该怎么做?:

c1 = np.repeat(['a','b'], [8, 8], axis=0)
c2 = list('xxxxyyyyxxxxyyyy')
c3 = ['G1','G1','G2','G2','G1','G1','G2','G2','G1','G1','G2','G2','G1','G1','G2','G2']
c4 = [1,2]*8
val = np.random.rand(16)
df = pd.DataFrame({'c1':c1,'c2':c2,'c3':c3,'c4':c4,'val':val})

最佳答案

你可以这样做:

In [6]: table
Out[6]:
c3 G1 G2
c4 1 2 1 2
c1 c2
a x 6 5 8 6
y 9 4 5 7
b x 5 10 4 6
y 7 4 6 8

In [7]: g = table.xs(2, level='c4', axis=1)

In [8]: g
Out[8]:
c3 G1 G2
c1 c2
a x 5 6
y 4 7
b x 10 6
y 4 8

In [9]: g.groupby(level='c2').apply(lambda x: x.iloc[0] - x.iloc[1])
Out[9]:
c3 G1 G2
c2
x -5 0
y 0 -1

或者,将 as_index=False 传递给 groupby 并在 lambda 中使用 loc,这有点恕我直言,因为您是按名称而不是整数位置进行索引,所以更有意义:

In [11]: g.groupby(level='c2', as_index=False).apply(lambda x: x.loc['a'] - x.loc['b'])
Out[11]:
c3 G1 G2
c2
x -5 0
y 0 -1

as_indexapply 仅适用于 pandas git master。如果您不使用 master,那么您将得到以下结果:

In [12]: r = g.groupby(level='c2').apply(lambda x: x.loc['a'] - x.loc['b'])

In [13]: r
Out[13]:
c3 G1 G2
c2 c2
x x -5 0
y y 0 -1

您可以通过重新分配 rindex 属性来删除重复的 index:

In [28]: r.index = r.index.droplevel(0)

In [29]: r
Out[29]:
c3 G1 G2
c2
x -5 0
y 0 -1

编辑:如果您有一个“熔化的”DataFrame,请执行以下操作:

In [28]: df
Out[28]:
c1 c2 c3 c4 val
0 a x G1 1 0.244
1 a x G1 2 0.572
2 a x G2 1 0.837
3 a x G2 2 0.893
4 a y G1 1 0.951
5 a y G1 2 0.400
6 a y G2 1 0.391
7 a y G2 2 0.237
8 b x G1 1 0.904
9 b x G1 2 0.811
10 b x G2 1 0.536
11 b x G2 2 0.736
12 b y G1 1 0.546
13 b y G1 2 0.159
14 b y G2 1 0.735
15 b y G2 2 0.772

In [29]: g2 = df[df.c4 == 2]

In [30]: g2
Out[30]:
c1 c2 c3 c4 val
1 a x G1 2 0.572
3 a x G2 2 0.893
5 a y G1 2 0.400
7 a y G2 2 0.237
9 b x G1 2 0.811
11 b x G2 2 0.736
13 b y G1 2 0.159
15 b y G2 2 0.772

In [31]: gb = g2.groupby(['c2', 'c3'])

In [32]: sub = gb.apply(lambda x: x.val.iloc[0] - x.val.iloc[1])

In [33]: sub
Out[33]:
c2 c3
x G1 -0.239
G2 0.157
y G1 0.241
G2 -0.535
dtype: float64

In [34]: sub.unstack()
Out[34]:
c3 G1 G2
c2
x -0.239 0.157
y 0.241 -0.535

每当我不确定 groupby 操作中的组的外观时,我都会迭代 groupby 并打印出其组成部分:

In [40]: for _, x in g2.groupby(['c2', 'c3']):
....: print x
....: print
....:
c1 c2 c3 c4 val
1 a x G1 2 0.572
9 b x G1 2 0.811

c1 c2 c3 c4 val
3 a x G2 2 0.893
11 b x G2 2 0.736

c1 c2 c3 c4 val
5 a y G1 2 0.400
13 b y G1 2 0.159

c1 c2 c3 c4 val
7 a y G2 2 0.237
15 b y G2 2 0.772

这些是传递给 groupby.apply()lambda x: ... 中的 x

关于python - 如何在 pandas 中交叉表的特定行和列上执行计算?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18647647/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com