gpt4 book ai didi

python - Pandas ,分组并在组中找到最大值,返回值和计数

转载 作者:太空狗 更新时间:2023-10-29 21:52:41 24 4
gpt4 key购买 nike

我有一个带有日志数据的 pandas DataFrame:

        host service
0 this.com mail
1 this.com mail
2 this.com web
3 that.com mail
4 other.net mail
5 other.net web
6 other.net web

我想在每台主机上找到错误最多的服务:

        host service  no
0 this.com mail 2
1 that.com mail 1
2 other.net web 2

我找到的唯一解决方案是按主机和服务分组,然后迭代超过索引的 0 级。

谁能推荐一个更好、更短的版本?没有迭代?

df = df_logfile.groupby(['host','service']).agg({'service':np.size})

df_count = pd.DataFrame()
df_count['host'] = df_logfile['host'].unique()
df_count['service'] = np.nan
df_count['no'] = np.nan

for h,data in df.groupby(level=0):
i = data.idxmax()[0]
service = i[1]
no = data.xs(i)[0]
df_count.loc[df_count['host'] == h, 'service'] = service
df_count.loc[(df_count['host'] == h) & (df_count['service'] == service), 'no'] = no

完整代码 https://gist.github.com/bjelline/d8066de66e305887b714

最佳答案

给定df,下一步是单独按host值分组,
idxmax 聚合。这为您提供了索引对应最大的服务值(value)。然后,您可以使用 df.loc[...]df 中选择与最大服务值相对应的行:

import numpy as np
import pandas as pd

df_logfile = pd.DataFrame({
'host' : ['this.com', 'this.com', 'this.com', 'that.com', 'other.net',
'other.net', 'other.net'],
'service' : ['mail', 'mail', 'web', 'mail', 'mail', 'web', 'web' ] })

df = df_logfile.groupby(['host','service'])['service'].agg({'no':'count'})
mask = df.groupby(level=0).agg('idxmax')
df_count = df.loc[mask['no']]
df_count = df_count.reset_index()
print("\nOutput\n{}".format(df_count))

产生DataFrame

        host service  no
0 other.net web 2
1 that.com mail 1
2 this.com mail 2

关于python - Pandas ,分组并在组中找到最大值,返回值和计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26701849/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com