gpt4 book ai didi

python - 用日期做这个 pandas 公式的更快(矢量化)方法

转载 作者:行者123 更新时间:2023-12-04 02:26:43 24 4
gpt4 key购买 nike

我正在建立一个时间序列,试图找到一种更有效的方式来做到这一点 - 最好是矢量化。pandas apply with list comprehension step 非常慢(在大数据集上)。

import datetime
import pandas as pd

# Dummy data:
todays_date = datetime.datetime.now().date()
xdates = pd.date_range(todays_date-datetime.timedelta(10), periods=4, freq='D')
categories = list(2*'A') + list(2*'B')
d = {'xdate': xdates, 'periods': [8]*2 + [2]*2, 'interval': [3]*2 + [12]*2}
df = pd.DataFrame(d,index=categories)

# This step is slow:
df['sdates'] = df.apply(lambda x: [x.xdate + pd.DateOffset(months=k*x.interval) for k in range(x.periods)], axis=1)
# This step is quite quick, but shown here for completeness
df = df.explode('sdates')

也许是这样的:

df['sdates'] = [df.xdate + df.periods * [df.interval.astype('timedelta64[M]')]]

但语法不太正确。这段代码

df = pd.DataFrame(d,index=categories)
df['m_offsets'] = df.interval.apply(lambda x: list(range(0, 72, x)))
df = df.explode('m_offsets')
df['sdate'] = df.xdate + df.m_offsets * pd.DateOffset(months=1)

我觉得和其中一个答案差不多,但是最后一步,pd.DateOffset给出了警告:

PerformanceWarning: Adding/subtracting array of DateOffsets to DatetimeArray not vectorized

我尝试按照一个答案构建一些东西,但如前所述,模块化算术需要进行大量调整以处理边缘情况,而且还没有弄清楚(日历月范围表现不佳)。此函数不运行:

from calendar import monthrange
def add_months(df, date_col, n_col):
""" Adds ncol months do date_col """
z = df.copy()
# calculate new year/month/day and convert to datetime
z['year'] = (z[date_col].dt.year * 12 + (z[date_col].dt.month-1) + z[n_col]) // 12
z['month'] = ((z[date_col].dt.month + z[n_col] - 1) % 12) + 1
x,x = monthrange(z.year, z.month)
z['days_in_month'] = monthrange(z.year, z.month)
z['target_day'] = z[date_col].dt.day
# z['day'] = min(z.target_day, z.days_in_month)
z['day'] = z.days_in_month
z['sdates'] = pd.to_datetime(z[['year', 'month', 'day']])
return z['sdates']

目前这可行,但日期偏移量是一个非常沉重的步骤。

df = pd.DataFrame(d,index=categories)
df['m_offsets'] = df.interval.apply(lambda x: list(range(0, 72, x)))
df = df.explode('m_offsets')
df['sdates'] = df.apply(lambda x: x.xdate + pd.DateOffset(months=x.m_offsets), axis=1)

最佳答案

这是一种选择。您要添加月份,因此我们实际上可以通过仅以矢量化方式处理整数来计算新的年/月/日,然后根据这些 y/m/d 组合创建日期时间:

def f_proposed(df):
z = df.copy()
z = z.reset_index()

# repeat xdate as many times as the number of periods
z = z.loc[np.repeat(z.index, z['periods'])]

# calculate k number of months to add
z['k'] = z.groupby(level=0).cumcount() * z['interval']

# calculate new year/month/day and convert to datetime
z['year'] = (z['xdate'].dt.year * 12 + z['xdate'].dt.month - 1 + z['k']) // 12
z['month'] = (z['xdate'].dt.month - 1 + z['k']) % 12 + 1

# clip day to days_in_month
z['days_in_month'] = pd.to_datetime(
z['year'].astype(str)+'-'+z['month'].astype(str)+'-01').dt.days_in_month
z['day'] = np.clip(z['xdate'].dt.day, 0, z['days_in_month'])

z['sdates'] = pd.to_datetime(z[['year', 'month', 'day']])

# drop temporary columns
z = z.set_index('index').drop(columns=['k', 'year', 'month', 'day', 'days_in_month'])
return z

为了将性能与原始数据进行比较,我生成了一个包含 10,000 行的测试数据集。

这是我的时间安排(10K 加速约 23 倍):

%timeit f_proposed(z)
82.7 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit f_original(z)
1.92 s ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

附言对于 170K,在我的机器上使用 f_proposed 需要大约 1.39s,使用 f_original 需要 33.6s

关于python - 用日期做这个 pandas 公式的更快(矢量化)方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67093638/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com