gpt4 book ai didi

python - Dask:使用groupby获取组中具有最大值的行

转载 作者:太空宇宙 更新时间:2023-11-03 11:20:03 29 4
gpt4 key购买 nike

同样的问题可以在 Pandas 中使用 transform 解决,如 here 所述与 dask 唯一的工作 solution我发现使用合并。我想知道是否还有其他方法可以实现它。

最佳答案

首先,我想重写您原始问题中引用的脚本,以确保我理解了它的意图。据我所知,正如我在下面的重写中所说明的,您本质上想要一种方法来提取具有最高计数 cnt 值的值,用于 foo 的每个唯一对和 bar。下面大致介绍了引用脚本如何仅使用 Pandas 完成此操作。

# create an example dataframe
df = pd.DataFrame({
'foo' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'bar' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'cnt' : [3, 2, 5, 8, 10, 1, 2, 2, 7],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
})


grouped_df = (df.groupby(['foo', 'bar']) # creates a double nested indices
.agg({'cnt': 'max'}) # returns max value from each grouping
.rename(columns={'cnt': 'cnt_max'}) # renames the col to avoid conflicts on merge later
.reset_index()) # makes the double nested indices columns instead

merged_df = pd.merge(df, grouped_df, how='left', on=['foo', 'bar'])

# note: I believe a shortcoming here is that if ther eis more than one match, this would
# return multiple results for some pairings...
final_df = merged_df[merged_df['cnt'] == merged_df['cnt_max']]

现在,下面是我对 Dask 就绪版本的看法。详见评论。

# create an example dataframe
df = pd.DataFrame({
'foo' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'bar' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'cnt' : [3, 2, 5, 8, 10, 1, 2, 2, 7],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
})

# I'm not sure if we can rely on val to be a col of unique values so I am just going to
# make a new column that is the id for this, now on a very large dataframe that wouldn't
# fit in memory this may not be a reasonable method of creating a unique new column but
# for the purposes of this example this will be sufficient
df['id'] = np.arange(len(df))

# now let's convert this dataframe into a Dask dataframe
# we will only use 1 partition because this is a small sample and would use more in a real world case
ddf = dd.from_pandas(df, npartitions=1)

# create a function that take the results of the grouped by sub dataframes and returns the row
# where the cnt is greatest
def select_max(grouped_df):
row_with_max_cnt_index = grouped_df['cnt'].argmax()
row_with_max_cnt = grouped_df.loc[row_with_max_cnt_index]
return row_with_max_cnt['id']

# now chain that function into an apply run on the output of the groupby operation
# note: this also may not be the best strategy if the resulting list is too long
# if that is the case, will need to better thread the output of this into the next step
keep_ids = ddf.groupby(['foo', 'bar']).apply(select_max, meta=pd.Series()).compute()

# this is pretty straightforward, just get the rows that match the ids from the max cnt applied method
subset_df = ddf[ddf['id'].isin(keep_ids)]
print(subset_df.compute())

关于python - Dask:使用groupby获取组中具有最大值的行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44855266/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com