gpt4 book ai didi

python - 向量化一个极其缓慢的 groupBy

转载 作者:行者123 更新时间:2023-12-01 03:04:10 27 4
gpt4 key购买 nike

我有一个数据框,我大部分都对其进行了矢量化,但在某些列上需要导致使用 groupBy 进行循环。对于小型数据集,速度是可以接受的,但对于大于 50k+ 行的任何数据,速度会变得非常慢。

基本思想是当列unique有一个值( np.isfinite ),等待几天(示例中为 4)并设置 completeTrue .重复。应忽略 4 个时期(天)之间的阳性结果。

这就是我现在所拥有的,它功能完美,但速度非常慢。我对如何将其矢量化非常感兴趣。

times = np.arange(datetime(2019, 11, 1), datetime(2019, 12, 1), timedelta(days=1)).astype(datetime)
times = np.concatenate([times, times])
names = np.array(['ALFA'] * 30 + ['BETA'] * 30)

unique = np.random.randn(60)
unique[unique < 0.7] = np.nan

df = pd.DataFrame({'unique':unique, 'complete':np.nan}, index=[names, times])
df.index = df.index.set_names(['Name', 'Date'])

df['num'] = df.groupby('Name').cumcount()
entryNum, posit = len(df.index)+1, 0

for n, group in df.groupby(level=['Name']):
posit = 0
for date, col in group.groupby(level=['Date']):
if col.num[0] - entryNum == 4:
posit = 0
df.loc[(n, date), 'complete'] = True
if not posit and np.isfinite(col.unique[0]):
posit = 1
entryNum = col.num[0]

拉斐尔克 的解决方案是天才,但在某些情况下存在差异:
unique 的测试集柱子:
unique = [0.808154, np.nan, np.nan, 0.976455, np.nan, 1.81917, np.nan, 0.732306, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 0.878656, np.nan, 1.087899, 1.57941, 1.211292, np.nan, 1.431411, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 1.323002, 1.339211, np.nan, np.nan, 1.322755, np.nan, 0.960014, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 1.833514, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 2.3884, np.nan, np.nan, 1.372292, np.nan, np.nan]

输出:
                   unique complete  countnonnull  solution
Name Date
ALFA 2019-11-01 0.808154 NaN 1.0 False
2019-11-02 NaN NaN 1.0 False
2019-11-03 NaN NaN 1.0 False
2019-11-04 0.976455 NaN 2.0 False
2019-11-05 NaN True 1.0 True
2019-11-06 1.819170 NaN 2.0 False
2019-11-07 NaN NaN 2.0 False
2019-11-08 0.732306 NaN 2.0 False
2019-11-09 NaN NaN 2.0 False
2019-11-10 NaN True 1.0 False
2019-11-11 NaN NaN 1.0 False
2019-11-12 NaN NaN 0.0 False
2019-11-13 NaN NaN 0.0 False
2019-11-14 NaN NaN 0.0 False
2019-11-15 NaN NaN 0.0 False
2019-11-16 NaN NaN 0.0 False
2019-11-17 NaN NaN 0.0 False
2019-11-18 0.878656 NaN 1.0 False
2019-11-19 NaN NaN 1.0 False
2019-11-20 1.087899 NaN 2.0 False
2019-11-21 1.579410 NaN 3.0 False
2019-11-22 1.211292 True 3.0 True
2019-11-23 NaN NaN 3.0 False
2019-11-24 1.431411 NaN 3.0 False
2019-11-25 NaN NaN 2.0 False
2019-11-26 NaN True 1.0 False
2019-11-27 NaN NaN 1.0 False
2019-11-28 NaN NaN 0.0 False
2019-11-29 NaN NaN 0.0 False
2019-11-30 NaN NaN 0.0 False
BETA 2019-11-01 1.323002 NaN 1.0 False
2019-11-02 1.339211 NaN 2.0 False
2019-11-03 NaN NaN 2.0 False
2019-11-04 NaN NaN 2.0 False
2019-11-05 1.322755 True 2.0 True
2019-11-06 NaN NaN 1.0 False
2019-11-07 0.960014 NaN 2.0 False
2019-11-08 NaN NaN 2.0 False
2019-11-09 NaN True 1.0 False
2019-11-10 NaN NaN 1.0 False
2019-11-11 NaN NaN 0.0 False
2019-11-12 NaN NaN 0.0 False
2019-11-13 NaN NaN 0.0 False
2019-11-14 1.833514 NaN 1.0 False
2019-11-15 NaN NaN 1.0 False
2019-11-16 NaN NaN 1.0 False
2019-11-17 NaN NaN 1.0 False
2019-11-18 NaN True 0.0 True
2019-11-19 NaN NaN 0.0 False
2019-11-20 NaN NaN 0.0 False
2019-11-21 NaN NaN 0.0 False
2019-11-22 NaN NaN 0.0 False
2019-11-23 NaN NaN 0.0 False
2019-11-24 NaN NaN 0.0 False
2019-11-25 2.388400 NaN 1.0 False
2019-11-26 NaN NaN 1.0 False
2019-11-27 NaN NaN 1.0 False
2019-11-28 1.372292 NaN 2.0 False
2019-11-29 NaN True 1.0 True
2019-11-30 NaN NaN NaN False

最佳答案

这是我的方法,仅 groupby 一次:

def update(v, thresh=4):
ret = v.copy()
count = 5
for i in ret.index:
count += 1
if ret.loc[i]:
if count >= 4:
count = 0
else:
ret.loc[i] = np.nan
return ret

groups = df.groupby('Name')
df['f_complete'] = groups['unique'].shift(4).notnull()
df['f_complete']= groups['f_complete'].apply(update)

关于python - 向量化一个极其缓慢的 groupBy,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59215159/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com