gpt4 book ai didi

python - 如果条件成立,计算嵌套组的平均值

转载 作者:行者123 更新时间:2023-12-04 07:21:27 25 4
gpt4 key购买 nike

我拥有一个数据集( df ),其中包含不同超市( date )的历史每日( demand )需求( id_store )。每个id_store提供不同的产品( id_product ),但品种每天都不同,所以相同的 id_product不是每天都提供。
我的目标是为每个 id_product 找到最后四个相同工作日(t-7、t-14、t-21、t-28)的平均需求量各自的 id_store ,如果 id_productid_store 中提供在多个相同的工作日。如果id_product未在任何或仅一个相同的工作日提供,NaN应该被退回。
问题:
以下代码有效,但我的 df 需要大约 6 小时,其中包含 200 万个观测值。你们有没有人知道我如何使代码更高效(例如使用 groupby )?

import numpy as np
import pandas as pd
from pandas import Timestamp

def mean_weekday_4w(df):

df.loc[:, "weekday"] = df["date"].dt.day_name()
df_group = df.groupby(["id_store", "id_product", "weekday"], as_index=False)["demand"].sum()
query_4w = "date == '%s' | date == '%s' | date == '%s' | date == '%s'"

for ids, idp, wdy in df_group[["id_store", "id_product", "weekday"]].values:
df_query_group = df.query("id_store == @ids & id_product == @idp & weekday == @wdy")

for i, row in df_query_group.iterrows():
df_query_4w = df_query_group.query(query_4w % (row["date"] - pd.Timedelta(days=7),
row["date"] - pd.Timedelta(days=14),
row["date"] - pd.Timedelta(days=21),
row["date"] - pd.Timedelta(days=28)))

if df_query_4w.shape[0] >= 2:
df.at[i, "mean_weekday_4w"] = df_query_4w["demand"].mean()
else:
df.at[i, "mean_weekday_4w"] = np.nan


return df
这是我的数据集的一个小样本:
df = pd.DataFrame({'date': {0: Timestamp('2020-03-21 00:00:00'), 1: Timestamp('2020-03-28 00:00:00'), 2: Timestamp('2020-04-04 00:00:00'), 3: Timestamp('2020-04-11 00:00:00'), 4: Timestamp('2020-04-18 00:00:00'), 5: Timestamp('2020-03-21 00:00:00'), 6: Timestamp('2020-03-28 00:00:00'), 7: Timestamp('2020-04-04 00:00:00'), 8: Timestamp('2020-03-21 00:00:00'), 9: Timestamp('2020-03-25 00:00:00'), 10: Timestamp('2020-04-04 00:00:00'), 11: Timestamp('2020-03-21 00:00:00'), 12: Timestamp('2020-03-28 00:00:00'), 13: Timestamp('2020-04-18 00:00:00'), 14: Timestamp('2020-04-25 00:00:00'), 15: Timestamp('2020-05-02 00:00:00'), 16: Timestamp('2020-05-09 00:00:00'), 17: Timestamp('2020-05-16 00:00:00'), 18: Timestamp('2020-03-21 00:00:00')}, 'id_store': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1}, 'id_product': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 2, 8: 1, 9: 1, 10: 1, 11: 2, 12: 2, 13: 3, 14: 3, 15: 3, 16: 3, 17: 3, 18: 4}, 'demand': {0: 97.0, 1: 156.0, 2: 130.0, 3: 105.0, 4: 229.0, 5: 60.0, 6: 50.0, 7: 47.0, 8: 69.0, 9: 95.0, 10: 175.0, 11: 66.0, 12: 63.0, 13: 82.0, 14: 7.0, 15: 78.0, 16: 34.0, 17: 36.0, 18: 81.0}})
所需的输出如下所示:



日期
id_store
id_product
要求
mean_weekday_4w


0
2020-03-21
0
1
97.00


1
2020-03-28
0
1
156.00


2
2020-04-04
0
1
130.00
126.50

3
2020-04-11
0
1
105.00
127.67

4
2020-04-18
0
1
229.00
122.00

5
2020-03-21
0
2
60.00


6
2020-03-28
0
2
50.00


7
2020-04-04
0
2
47.00
55.00

8
2020-03-21
1
1
69.00


9
2020-03-25
1
1
95.00


10
2020-04-04
1
1
175.00


11
2020-03-21
1
2
66.00


12
2020-03-28
1
2
63.00


13
2020-04-18
1
3
82.00


14
2020-04-25
1
3
7.00


15
2020-05-02
1
3
78.00
44.50

16
2020-05-09
1
3
34.00
55.67

17
2020-05-16
1
3
36.00
50.25

18
2020-03-21
1
4
81.00

最佳答案

按工作日分组并使用至少 2 个观察(和当前)滚动 4 周并排除当前观察(仅关闭左侧间隔):

df['mean_weekday_4w'] = df.groupby(['id_store', 'id_product', df['date'].dt.weekday]) \
.rolling(4, 2, closed='left') \
.mean().values
>>> df
date id_store id_product demand mean_weekday_4w
0 2020-03-21 0 1 97.0 NaN
1 2020-03-28 0 1 156.0 NaN
2 2020-04-04 0 1 130.0 126.500000
3 2020-04-11 0 1 105.0 127.666667
4 2020-04-18 0 1 229.0 122.000000
5 2020-03-21 0 2 60.0 NaN
6 2020-03-28 0 2 50.0 NaN
7 2020-04-04 0 2 47.0 55.000000
8 2020-03-21 1 1 69.0 NaN
9 2020-03-25 1 1 95.0 NaN
10 2020-04-04 1 1 175.0 NaN
11 2020-03-21 1 2 66.0 NaN
12 2020-03-28 1 2 63.0 NaN
13 2020-04-18 1 3 82.0 NaN
14 2020-04-25 1 3 7.0 NaN
15 2020-05-02 1 3 78.0 44.500000
16 2020-05-09 1 3 34.0 55.666667
17 2020-05-16 1 3 36.0 50.250000
18 2020-03-21 1 4 81.0 NaN
性能
# create 1,999,997 records
df1 = pd.concat([df]*105263)

# change just the store id to create new groups
df1['id_store'] = pd.concat([df['id_store']+i for i in range(N)]).values
%timeit df1.groupby(['id_store', 'id_product', df1['date'].dt.weekday]).rolling(4, 2, closed='left').mean().values
57.9 s ± 1.58 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

关于python - 如果条件成立,计算嵌套组的平均值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68470617/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com