gpt4 book ai didi

python - 使用 groupby+apply 对每个组进行聚类 - 性能问题

转载 作者:行者123 更新时间:2023-12-01 00:09:43 25 4
gpt4 key购买 nike

我有一个数据框如下:

import pandas as pd
import numpy as np

df = pd.DataFrame.from_dict(
{'id': {(1, 0, 'obj11'): '3',
(1, 0, 'obj12'): '9',
(1, 0, 'obj13'): '5',
(1, 0, 'obj14'): '4',
(1, 0, 'obj15'): '23',
(1, 0, 'obj16'): '52',
(1, 0, 'obj17'): '22',
(1, 0, 'obj18'): '13',
(1, 0, 'obj19'): '8',
(1, 0, 'obj20'): '26',
(1, 1000, 'obj11'): '3',
(1, 1000, 'obj12'): '9',
(1, 1000, 'obj13'): '5',
(1, 1000, 'obj14'): '4',
(1, 1000, 'obj15'): '23',
(1, 1000, 'obj16'): '52',
(1, 1000, 'obj17'): '22',
(1, 1000, 'obj18'): '13',
(1, 1000, 'obj19'): '8',
(1, 1000, 'obj20'): '26',
(1, 2000, 'obj11'): '3',
(1, 2000, 'obj12'): '9',
(1, 2000, 'obj13'): '5',
(1, 2000, 'obj14'): '4',
(1, 2000, 'obj15'): '23',
(1, 2000, 'obj16'): '52',
(1, 2000, 'obj17'): '22',
(1, 2000, 'obj18'): '13',
(1, 2000, 'obj19'): '8',
(1, 2000, 'obj20'): '26',
(1, 3000, 'obj11'): '3',
(1, 3000, 'obj12'): '9',
(1, 3000, 'obj13'): '5',
(1, 3000, 'obj14'): '4',
(1, 3000, 'obj15'): '23',
(1, 3000, 'obj16'): '52',
(1, 3000, 'obj17'): '22',
(1, 3000, 'obj18'): '13',
(1, 3000, 'obj19'): '8',
(1, 3000, 'obj20'): '26',
(1, 4000, 'obj11'): '3',
(1, 4000, 'obj12'): '9',
(1, 4000, 'obj13'): '5',
(1, 4000, 'obj14'): '4',
(1, 4000, 'obj15'): '23',
(1, 4000, 'obj16'): '52',
(1, 4000, 'obj17'): '22',
(1, 4000, 'obj18'): '13',
(1, 4000, 'obj19'): '8',
(1, 4000, 'obj20'): '26'},
'var': {(1, 0, 'obj11'): 61.05099868774414,
(1, 0, 'obj12'): 52.6510009765625,
(1, 0, 'obj13'): 61.422000885009766,
(1, 0, 'obj14'): 75.99199676513672,
(1, 0, 'obj15'): 72.30999755859375,
(1, 0, 'obj16'): 63.79999923706055,
(1, 0, 'obj17'): 52.604000091552734,
(1, 0, 'obj18'): 61.02899932861328,
(1, 0, 'obj19'): 65.16999816894531,
(1, 0, 'obj20'): 71.26699829101562,
(1, 1000, 'obj11'): 59.92499923706055,
(1, 1000, 'obj12'): 49.4630012512207,
(1, 1000, 'obj13'): 60.25299835205078,
(1, 1000, 'obj14'): 77.15299987792969,
(1, 1000, 'obj15'): 73.43199920654297,
(1, 1000, 'obj16'): 62.207000732421875,
(1, 1000, 'obj17'): 49.805999755859375,
(1, 1000, 'obj18'): 60.459999084472656,
(1, 1000, 'obj19'): 65.0199966430664,
(1, 1000, 'obj20'): 71.9520034790039,
(1, 2000, 'obj11'): 58.72600173950195,
(1, 2000, 'obj12'): 45.98500061035156,
(1, 2000, 'obj13'): 58.21099853515625,
(1, 2000, 'obj14'): 78.35800170898438,
(1, 2000, 'obj15'): 75.06199645996094,
(1, 2000, 'obj16'): 59.23500061035156,
(1, 2000, 'obj17'): 46.32699966430664,
(1, 2000, 'obj18'): 57.902000427246094,
(1, 2000, 'obj19'): 65.1510009765625,
(1, 2000, 'obj20'): 72.99099731445312,
(1, 3000, 'obj11'): 57.47800064086914,
(1, 3000, 'obj12'): 42.904998779296875,
(1, 3000, 'obj13'): 55.89699935913086,
(1, 3000, 'obj14'): 79.41999816894531,
(1, 3000, 'obj15'): 76.78800201416016,
(1, 3000, 'obj16'): 55.53099822998047,
(1, 3000, 'obj17'): 42.67900085449219,
(1, 3000, 'obj18'): 55.277000427246094,
(1, 3000, 'obj19'): 65.21199798583984,
(1, 3000, 'obj20'): 74.27400207519531,
(1, 4000, 'obj11'): 56.189998626708984,
(1, 4000, 'obj12'): 41.14099884033203,
(1, 4000, 'obj13'): 54.09000015258789,
(1, 4000, 'obj14'): 80.78099822998047,
(1, 4000, 'obj15'): 78.38999938964844,
(1, 4000, 'obj16'): 57.492000579833984,
(1, 4000, 'obj17'): 40.400001525878906,
(1, 4000, 'obj18'): 53.159000396728516,
(1, 4000, 'obj19'): 63.72200012207031,
(1, 4000, 'obj20'): 75.40399932861328}}
)
df.index.names = ['k', 'group', 'object']

我想将组内的对象聚类,所以我定义了一个将组聚类的函数。
def cluster_group(x, index):
clustering = KMeans(n_clusters = 3, random_state = 42).fit(x.values)
return pd.Series(clustering.labels_, index = index)

并将其应用于我的 df如下:
df \ 
.groupby(['k', 'group']) \
.filter(lambda x: x.shape[0] > 3)['var'] \
.reset_index('object') \
.groupby(['k', 'group']) \
.apply(lambda x: cluster_group(x['var'], x['object']))

然而,正如原来的 DataFrame非常庞大,这个解决方案的工作速度非常慢。因此,我想问一下是否有办法以某种方式优化性能?

我的机器是 Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz,6 核 12 线程。 df.shape(1286135, 9) ,但我需要为许多数据框计算这个,每个数据框都在这个大小左右。所以我需要确保代码尽可能优化。

最佳答案

您最初的解决方案还不错。
我尝试使用 pd.DataFrame.unstack()方法走得快一点。

可能的解决方案:
df = df.unstack()['var'].apply(lambda x: cluster_group(x, x.index), axis=1)
用上面的例子进行速度测试:

%timeit -n 10 test = df \
.groupby(['k', 'group']) \
.filter(lambda x: x.shape[0] > 3)['var'] \
.reset_index('object') \
.groupby(['k', 'group']) \
.apply(lambda x: cluster_group(x['var'], x['object']))

每个循环 234 ms ± 20.9 ms(7 次运行的平均值 ± 标准偏差,每次 10 次循环)
%timeit -n 10 test = df.unstack()['var'].apply(lambda x: cluster_group(x, x.index), axis=1)

每个循环 205 ms ± 26.8 ms(7 次运行的平均值 ± 标准偏差,每次 10 次循环)

关于python - 使用 groupby+apply 对每个组进行聚类 - 性能问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59706979/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com