gpt4 book ai didi

python - Pandas:前 N 个以及剩余的总数。这对于每个组

转载 作者:行者123 更新时间:2023-12-01 08:41:00 25 4
gpt4 key购买 nike

我有一个数据框,其中包含国家、地区、城市、产品和销售额(以美元为单位)。我需要获取每个国家、地区、城市的前 3 个产品以及“其他”下的剩余产品以及相关销售额和单位

最终结果是国家、地区、城市每个组合的前 3 个产品 +“其他”

Country Region City Product Sales Val
Europe Italy Milan Ring 100 10
Europe Italy Milan Book 300 5
Europe Italy Milan Phone 1500 10
Europe Italy Milan Car 200 5
Europe Italy Milan Ring 100 10
Europe Italy Milan Pen 8200 5

前 3 名的结果:

Country Region City Product Sales Val
Europe Italy Milan Pen 8200 5
Europe Italy Milan Phone 1500 10
Europe Italy Milan Book 300 5
Europe Italy Milan Other 400 25

最佳答案

首先需要通过 reset_index 创建默认索引:

df = df.reset_index(drop=True)

然后按 Sales 列按 sort_values 排序与 GroupBy.head获取每组前 3 行:

cols = ['Country','Region', 'City']
df1 = df.sort_values('Sales', ascending=False).groupby(cols).head(3)
print (df1)
Country Region City Product Sales Val
5 Europe Italy Milan Pen 8200 5
2 Europe Italy Milan Phone 1500 10
1 Europe Italy Milan Book 300 5

然后过滤出用于 top3 的行并聚合 sum:

df2 = df.loc[df.index.difference(df1.index)]
df2 = df2.groupby(cols, as_index=False).sum().assign(Product='Other')
print (df2)
Country Region City Sales Val Product
0 Europe Italy Milan 400 25 Other

最后一次加入 concat :

df = pd.concat([df1, df2]).sort_values(cols).reset_index(drop=True)
print (df)
City Country Product Region Sales Val
0 Milan Europe Pen Italy 8200 5
1 Milan Europe Phone Italy 1500 10
2 Milan Europe Book Italy 300 5
3 Milan Europe Other Italy 400 25

另一个解决方案:

print (df)
Country Region City Product Sales Val
0 Europe Italy Milan Ring 100 10
1 Europe Italy Milan Book 300 5
2 Europe Italy Milan Phone 1500 10
3 Europe Italy Milan Car 200 5
4 Europe Italy Milan Ring 100 10
5 Europe Italy Rome Pen 8200 5
6 Europe Italy Rome Ring 100 10
7 Europe Italy Rome Book 300 5
8 Europe Italy Rome Phone 1500 10
9 Europe Italy Rome Car 200 5
10 Europe Italy Rome Ring 100 10
11 Europe Italy Rome Pencil 8100 5

Idea 按 Sales 对值进行排序,并按 cumcount 每组创建计数器列并将 Product 的值替换为 Other:

cols = ['Country','Region', 'City']
df['g'] = df.sort_values('Sales', ascending=False).groupby(cols).cumcount()
df['Product'] = np.where(df['g'] >= 3 , 'Other', df['Product'])
print (df)
Country Region City Product Sales Val g
0 Europe Italy Milan Other 100 10 3
1 Europe Italy Milan Book 300 5 1
2 Europe Italy Milan Phone 1500 10 0
3 Europe Italy Milan Car 200 5 2
4 Europe Italy Milan Other 100 10 3
5 Europe Italy Rome Pen 8200 5 0
6 Europe Italy Rome Other 100 10 3
7 Europe Italy Rome Other 300 5 3
8 Europe Italy Rome Phone 1500 10 2
9 Europe Italy Rome Other 200 5 3
10 Europe Italy Rome Other 100 10 3
11 Europe Italy Rome Pencil 8100 5 1

然后通过sum聚合:

df2 = (df.groupby(cols + ['Product'], as_index=False).sum()
.sort_values(cols + ['g'])
.drop('g', axis=1)
.reset_index(drop=True))
print (df2)
Country Region City Product Sales Val
0 Europe Italy Milan Phone 1500 10
1 Europe Italy Milan Book 300 5
2 Europe Italy Milan Car 200 5
3 Europe Italy Milan Other 200 20
4 Europe Italy Rome Pen 8200 5
5 Europe Italy Rome Pencil 8100 5
6 Europe Italy Rome Phone 1500 10
7 Europe Italy Rome Other 700 30

关于python - Pandas:前 N 个以及剩余的总数。这对于每个组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53500949/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com