gpt4 book ai didi

python - Pandas 在组内排序然后聚合

转载 作者:行者123 更新时间:2023-12-05 02:01:45 25 4
gpt4 key购买 nike

我正在做搜索引擎的查询分析。用户可能会在一个 session 中的不同时间在谷歌搜索引擎上一个接一个地搜索不同的查询。

我有几个字段的数据:session_idlog_timequeryfeature_i 等。我想要按 session_id 分组,然后按 log_time 的顺序concat 将几行合并为一行。这样输出的数据将以时间序列的方式表示用户的行为。

数据集

代码:

toy_data = pd.DataFrame({'session_id':[1,2,1,2,3,3,],
'log_time':[4,5,6,1,2,3],
'query':['hi','dude','pandas','groupby','sort','agg'],
'cate_feat_0':['apple','banana']*3,
'num_feat_0':[1,2,3,4,5,6]})
print(toy_data)

输出:

       session_id  log_time query cate_feat_0  num_feat_0
0 1 4 hi apple 1
1 2 5 dude banana 2
2 1 6 pandas apple 3
3 2 1 groupby banana 4
4 3 2 sort apple 5
5 3 3 agg banana 6

我想要的:

## note that all list are sorted by log time with each session_id group
session_id query_list log_time_list cate_feat_0_list num_feat_0_list
1 [hi, pandas] [4,6] [apple, apple] [1,3]
2 [groupby, dude] [1,5] [banana, banana] [4,2]
3 [sort,agg] [2,3] [apple, banana] [5,6]

我的尝试

首先我们用代码分组和聚合:

toy_data_res = toy_data.groupby('session_id').agg({'query':list, 'log_time':list, 'cate_feat_0':list, 'num_feat_0':list})
toy_data_res

给予:

                      query log_time       cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [dude, groupby] [5, 1] [banana, banana] [2, 4]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]

然后我们在每个 session 中使用代码进行排序:

for i in toy_data_res.index:
sort_index = np.argsort(toy_data_res.loc[i,'log_time']) ## get time order with in group
for col in toy_data_res.columns.values:
toy_data_res.loc[i,col] = [toy_data_res.loc[i,col][j] for j in sort_index] ## sort values in cols
toy_data_res

给予:

                      query log_time       cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [groupby, dude] [1, 5] [banana, banana] [4, 2]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]

我的方法是快慢。有没有更好的方法来做 groupby -> sort with in group -> aggregation

提示: We can use STRING_AGG or GROUP_CONCAT in SQL to do within group sorting.

最佳答案

使用DataFrame.sort_valuesgroupby 之前,如果需要应用相同的功能,可以使用列名列表:

df = (toy_data.sort_values(['session_id','log_time'])
.groupby('session_id')[['query','log_time','cate_feat_0', 'num_feat_0']]
.agg(list))


print (df)
query log_time cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [groupby, dude] [1, 5] [banana, banana] [4, 2]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]

关于python - Pandas 在组内排序然后聚合,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66346364/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com