gpt4 book ai didi

python - map 优化,按对象分组

转载 作者:行者123 更新时间:2023-12-02 15:43:15 24 4
gpt4 key购买 nike

我有以下数据框

test_df = pd.DataFrame({'Category': {0: 'product-availability address-confirmation input',
1: 'registration register-data-confirmation options',
2: 'onboarding return-start input',
3: 'registration register-data-confirmation input',
4: 'decision-tree first-interaction-validation options'},
'Original_UserId': {0: '5511949551865@wa.gw.msging.net',
1: '5511949551865@wa.gw.msging.net',
2: '5511949551865@wa.gw.msging.net',
3: '5511949551865@wa.gw.msging.net',
4: '5511949551865@wa.gw.msging.net'}})

感谢 jezrael 我正在应用以下 map ,它遵循这个问题 After certain string is found mark every after string as true,pandas 中给出的逻辑

test_df.groupby('Original_UserId',observed=True)['Category'].apply(lambda s : s.eq('onboarding return-start input').cummax())

返回以下系列

pd.Series({0: False, 1: False, 2: True, 3: True, 4: True})

问题是,当我将此条件应用于更大的数据集时,运行此代码需要相当长的时间。关于如何优化的任何线索?

最佳答案

如果你的数据框很大,你可以使用multiprocessing,避免使用DaskPySpark:

def f(s):
return s.eq('onboarding return-start input').cummax()

if __name__ == '__main__':
# test_df = ...
with mp.Pool(mp.cpu_count()) as pool:
groups = test_df.groupby('Original_UserId',observed=True)['Category']
data = pool.map(f, [g for _, g in groups])
s = pd.concat(data)

输出:

>>> s
0 False
1 False
2 True
3 True
4 True
Name: Category, dtype: bool

关于python - map 优化,按对象分组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75285604/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com