gpt4 book ai didi

python - 链接分组、过滤和聚合

转载 作者:太空狗 更新时间:2023-10-29 22:29:27 25 4
gpt4 key购买 nike

DataFrameGroupby.filter 方法过滤组,并返回包含通过过滤器的行的DataFrame

但是过滤后如何获取新的DataFrameGroupBy对象而不是DataFrame

例如,假设我有一个 DataFrame df,其中包含两列 AB。我想为 A 列的每个值获取 B 列的平均值,只要该组中至少有 5 行:

# pandas 0.18.0
# doesn't work because `filter` returns a DF not a GroupBy object
df.groupby('A').filter(lambda x: len(x)>=5).mean()
# works but slower and awkward to write because needs to groupby('A') twice
df.groupby('A').filter(lambda x: len(x)>=5).reset_index().groupby('A').mean()
# works but more verbose than chaining
groups = df.groupby('A')
groups.mean()[groups.size() >= 5]

最佳答案

你可以这样做:

In [310]: df
Out[310]:
a b
0 1 4
1 7 3
2 6 9
3 4 4
4 0 2
5 8 4
6 7 7
7 0 5
8 8 5
9 8 7
10 6 1
11 3 8
12 7 4
13 8 0
14 5 3
15 5 3
16 8 1
17 7 2
18 9 9
19 3 2
20 9 1
21 1 2
22 0 3
23 8 9
24 7 7
25 8 1
26 5 8
27 9 6
28 2 8
29 9 0

In [314]: r = df.groupby('a').apply(lambda x: x.b.mean() if len(x)>=5 else -1)

In [315]: r
Out[315]:
a
0 -1.000000
1 -1.000000
2 -1.000000
3 -1.000000
4 -1.000000
5 -1.000000
6 -1.000000
7 4.600000
8 3.857143
9 -1.000000
dtype: float64

In [316]: r[r>0]
Out[316]:
a
7 4.600000
8 3.857143
dtype: float64

单行代码,返回数据框而不是系列:

df.groupby('a') \
.apply(lambda x: x.b.mean() if len(x)>=5 else -1) \
.to_frame() \
.rename(columns={0:'mean'}) \
.query('mean > 0')

与具有 100.000 行的 DF 的时间比较:

def maxu():
r = df.groupby('a').apply(lambda x: x.b.mean() if len(x)>=5 else -1)
return r[r>0]

def maxu2():
return df.groupby('a') \
.apply(lambda x: x.b.mean() if len(x)>=5 else -1) \
.to_frame() \
.rename(columns={0:'mean'}) \
.query('mean > 0')

def alexander():
return df.groupby('a', as_index=False).filter(lambda group: group.a.count() >= 5).groupby('a').mean()

def alexander2():
vc = df.a.value_counts()
return df.loc[df.a.isin(vc[vc >= 5].index)].groupby('a').mean()

结果:

In [419]: %timeit maxu()
1 loop, best of 3: 1.12 s per loop

In [420]: %timeit maxu2()
1 loop, best of 3: 1.12 s per loop

In [421]: %timeit alexander()
1 loop, best of 3: 34.9 s per loop

In [422]: %timeit alexander2()
10 loops, best of 3: 66.6 ms per loop

检查:

In [423]: alexander2().sum()
Out[423]:
b 19220943.162
dtype: float64

In [424]: maxu2().sum()
Out[424]:
mean 19220943.162
dtype: float64

结论:

明显的赢家是 alexander2() 函数

@Alexander,恭喜!

关于python - 链接分组、过滤和聚合,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36389919/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com