gpt4 book ai didi

python - 用于转换 pandas 数据框的日期范围函数的向量化

转载 作者:行者123 更新时间:2023-12-01 03:56:53 24 4
gpt4 key购买 nike

这是一个根据当前日期将日期范围转换为数值的问题。

输入表:

   ID   START_DATE  END_DATE    CURRENT_DATE
1 2010-12-08 2011-03-01 2011-04-01
2 2010-12-10 2011-01-12 2011-01-02
3 2010-12-16 2011-03-07 2010-10-10

输出表:

   ID   START_DATE  END_DATE    CURRENT_DATE    number_of_days
1 2010-12-08 2011-03-01 2011-04-01 78.148490
2 2010-12-10 2011-01-12 2011-01-02 23.726149
3 2010-12-16 2011-03-07 2010-10-10 0.000000

其中 nubmer_of_days 是根据指数衰减函数计算的,然后是一行的所有值的总和。

我们可以实现如下功能:

def transform(start, end, current):
value = 0
if current > end: #current date is later than the end date
delta = end - start
for i in range(delta.days + 1):
diff = current - (start + td(days = i))
value += math.exp(- 0.001 * diff.days)
elif current > start: #current date is between the start and end
delta = current - start
for i in range(delta.days + 1):
diff = current - (start + td(days = i))
value += math.exp(-0.001 * diff.days)
else:
pass
return value

然后应用以下转换:

df['number_of_days'] = df.apply(lambda x: transform(x['START_DATE'], x['END_DATE'], x['CURRENT_DATE']),axis=1)

但是,对于具有数百万行和巨大日期范围的表来说,这非常慢。

关于如何通过向量化转换函数中的内部 for 循环来加速该过程有什么想法吗?

谢谢!

最佳答案

您可以使用numpy array进行矢量化函数来计算指数衰减。

df = df[df.CURRENT_DATE > df.START_DATE] # just focusing on cases with calculation

获取相关delta取决于CURRENT_DATEEND_DATE :

delta = df[['END_DATE', 'CURRENT_DATE']].min(axis=1).subtract(df.START_DATE).dt.days.add(1)

计算shift arange()的指数衰减为 max END_DATE之间的差异和CURRENT_DATE0 :

shift = df.CURRENT_DATE.subtract(df.END_DATE).dt.days.clip(lower=0)

生产加工(调整后)arange使用 np.exp() 的对象和np.sum() :

df['number_of_days'] = [np.sum(np.exp(-0.001 * (np.arange(d) + s))) for d, s in zip(delta.values, shift.values)]

获取:

   START_DATE   END_DATE CURRENT_DATE  number_of_days
ID
1 2010-12-08 2011-03-01 2011-04-01 78.148490
2 2010-12-10 2011-01-12 2011-01-02 23.726149

如果比较性能,您会发现节省循环所带来的效率提升:

df_test = pd.concat([df for _ in range(100000)])

def transform1(df):
df = df[df.CURRENT_DATE > df.START_DATE]
delta = df[['END_DATE', 'CURRENT_DATE']].min(axis=1).subtract(df.START_DATE).dt.days.add(1)
shift = df.CURRENT_DATE.subtract(df.END_DATE).dt.days.clip(lower=0)
return [np.sum(np.exp(-0.001 * (np.arange(d) + s))) for d, s in zip(delta.values, shift.values)]

%timeit transform1(df_test)
1 loop, best of 3: 4.99 s per loop

def transform2(df):
df['end'] = [d.days for d in df.CURRENT_DATE - df.START_DATE]
df['start'] = (df.end - [max(0, d.days + 1) for d in (df.END_DATE.where(df.CURRENT_DATE > df.END_DATE, df.CURRENT_DATE) - df.START_DATE)])
df['number_of_days'] = [sum(np.exp(-0.001 * i) for i in np.arange(stop, start, -1)) for start, stop in zip(df.start, df.end)]
df.drop(['start', 'end'], axis=1, inplace=True)

%timeit transform2(df_test)
1 loop, best of 3: 36.7 s per loop

关于python - 用于转换 pandas 数据框的日期范围函数的向量化,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37301529/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com