gpt4 book ai didi

Python Pandas : Resampling Multivariate Time Series with a Groupby

转载 作者:太空宇宙 更新时间:2023-11-03 14:40:07 24 4
gpt4 key购买 nike

我有以下通用格式的数据,我想将其重新采样为 30 天时间序列窗口:<​​/p>

'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
3,2017-07-15,thing3,55,17
3,2016-05-12,thing3,55,47
4,2012-02-23,thing2,150,22
4,2009-10-10,thing1,25,12
4,2014-04-04,thing2,150,2
5,2008-07-09,thing2,150,43

我希望 30 天的窗口期从 2014 年 1 月 1 日开始,到 2018 年 12 月 31 日结束。不保证每个客户的每个窗口都会有记录。如果客户在一个窗口中有多个交易,则它会采用价格的加权平均值,对单位进行求和,然后连接产品名称,以便为每个窗口的每个客户创建一条记录。

到目前为止我所拥有的是这样的:

wa = lambda x:np.average(x, weights=df.loc[x.index, 'units'])
con = lambda x: '/'.join(x))

agg_funcs = {'customer_id':'first',
'product':'con',
'price':'wa',
'transaction_dt':'first',
'units':'sum'}

df_window = df.groupby(['customer_id', pd.Grouper(freq='30D')]).agg(agg_funcs)
df_window_final = df_window.unstack('customer_id', fill_value=0)

如果有人知道一些更好的方法来解决这个问题(特别是使用就地和/或矢量化方法),我将不胜感激。理想情况下,我还想将窗口开始和停止日期作为列添加到行中。

理想情况下,最终输出如下所示:

'customer_id','transaction_dt','product','price','units','window_start_dt','window_end_dt'
1,2004-01-02,thing1/thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
2,2004-01-29,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
3,2017-07-15,thing3,(weighted average price),(total units),(window_start_dt),(window_end_dt)
3,2016-05-12,thing3,(weighted average price),(total units),(window_start_dt),(window_end_dt)
4,2012-02-23,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
4,2009-10-10,thing1,(weighted average price),(total units),(window_start_dt),(window_end_dt)
4,2014-04-04,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
5,2008-07-09,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)

最佳答案

已编辑新解决方案。我认为您可以将每个 transaction_dt 转换为 30 天的 period 对象,然后进行分组。

p = pd.period_range('2004-1-1', '12-31-2018',freq='30D')
def find_period(v):
p_idx = np.argmax(v < p.end_time)
return p[p_idx]
df['period'] = df['transaction_dt'].apply(find_period)
df

customer_id transaction_dt product price units period
0 1 2004-01-02 thing1 25 47 2004-01-01
1 1 2004-01-17 thing2 150 8 2004-01-01
2 2 2004-01-29 thing2 150 25 2004-01-01
3 3 2017-07-15 thing3 55 17 2017-06-21
4 3 2016-05-12 thing3 55 47 2016-04-27
5 4 2012-02-23 thing2 150 22 2012-02-18
6 4 2009-10-10 thing1 25 12 2009-10-01
7 4 2014-04-04 thing2 150 2 2014-03-09
8 5 2008-07-09 thing2 150 43 2008-07-08

我们现在可以使用此数据框来获取产品的串联、价格的加权平均值和单位总和。然后,我们使用一些周期功能来获取结束时间。

def my_funcs(df):
data = {}
data['product'] = '/'.join(df['product'].tolist())
data['units'] = df.units.sum()
data['price'] = np.average(df['price'], weights=df['units'])
data['transaction_dt'] = df['transaction_dt'].iloc[0]
data['window_start_time'] = df['period'].iloc[0].start_time
data['window_end_time'] = df['period'].iloc[0].end_time
return pd.Series(data, index=['transaction_dt', 'product', 'price','units',
'window_start_time', 'window_end_time'])

df.groupby(['customer_id', 'period']).apply(my_funcs).reset_index('period', drop=True)

enter image description here

关于Python Pandas : Resampling Multivariate Time Series with a Groupby,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46611626/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com