gpt4 book ai didi

python - 使用范围数据集返回 2 秒的累积和

转载 作者:行者123 更新时间:2023-11-30 21:53:13 25 4
gpt4 key购买 nike

我对 Python 和数据科学还不太熟悉。

我有这两个数据框:df 数据框

df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})

Date A
0 2014-11-21 11:00:00 1
1 2014-11-21 11:00:03 2
2 2014-11-21 11:00:04 5
3 2014-11-21 11:00:05 3
4 2014-11-21 11:00:07 9
5 2014-11-21 11:00:08 6
6 2014-11-21 11:00:10 3
7 2014-11-21 11:00:11 0
8 2014-10-24 10:00:55 8
9 2014-10-24 10:00:59 10

info Dataframe,此数据框包含我的最终 df 应包含的日期时间范围

info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:08:00', '2014-10-24 10:55:00'], "Stop": ['2014-11-21 11:07:00', '2014-11-21 11:11:00', '2014-10-24 10:59:00']})

Start Stop
0 2014-11-21 11:00:00 2014-11-21 11:00:07
1 2014-11-21 11:00:08 2014-11-21 11:00:11
2 2014-10-24 10:00:55 2014-10-24 10:00:59

目标是使用两秒窗口计算df中的累积和,当且仅当df中的实际行是在 info 中的某一行的范围内。例如,日期为 2014-11-21 11:00:08 的行的累积总和应为 0。因为它位于范围的开头,另一个示例是日期为 2014-11-21 11:00:07 的行,其总和应为 12(9+3) .

这是我到目前为止所取得的成就:

import pandas as pd
import numpy as np

df = pd.DataFrame({"Date": ['2014-11-21 11:00:00', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:07', '2014-11-21 11:00:08', '2014-11-21 11:00:10', '2014-11-21 11:00:11', '2014-10-24 10:00:55', '2014-10-24 10:00:59'], "A":[1, 2, 5, 3, 9, 6, 3, 0, 8, 10]})
info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:08', '2014-10-24 10:00:55'], "Stop": ['2014-11-21 11:00:07', '2014-11-21 11:00:11', '2014-10-24 10:00:59']})
#info = pd.DataFrame({"Start": ['2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05'], "Stop": ['2014-11-21 11:00:00', '2014-11-21 11:00:01', '2014-11-21 11:00:02', '2014-11-21 11:00:03', '2014-11-21 11:00:04', '2014-11-21 11:00:05', '2014-11-21 11:00:06', '2014-11-21 11:00:07']})
info['groupnum']=info.index
info.Start=pd.to_datetime(info.Start)
info.Stop=pd.to_datetime(info.Stop)
cinfo = info.set_index(pd.IntervalIndex.from_arrays(info.Start, info.Stop, closed='both'))['groupnum']
df['groupnum']=pd.to_datetime(df.Date).map(cinfo)
df['cum'] = df.groupby('groupnum').A.cumsum()
print(df)

预期结果:

                  Date   A  groupnum  cum
0 2014-11-21 11:00:00 1 0 1
1 2014-11-21 11:00:03 2 0 2
2 2014-11-21 11:00:04 5 0 7
3 2014-11-21 11:00:05 3 0 10
4 2014-11-21 11:00:07 9 0 12
5 2014-11-21 11:00:08 6 1 6
6 2014-11-21 11:00:10 3 1 9
7 2014-11-21 11:00:11 0 1 3
8 2014-10-24 10:00:55 8 2 8
9 2014-10-24 10:00:59 10 2 10

实际结果:

                  Date   A  groupnum  cum
0 2014-11-21 11:00:00 1 0 1
1 2014-11-21 11:00:03 2 0 3
2 2014-11-21 11:00:04 5 0 8
3 2014-11-21 11:00:05 3 0 11
4 2014-11-21 11:00:07 9 0 20
5 2014-11-21 11:00:08 6 1 6
6 2014-11-21 11:00:10 3 1 9
7 2014-11-21 11:00:11 0 1 9
8 2014-10-24 10:00:55 8 2 8
9 2014-10-24 10:00:59 10 2 18

但是这是对 groupnum 进行累积和,我无法仅累积 2 秒。

那么有什么适当的方法来实现这一目标吗?我将不胜感激。

我的英语不太好,希望我能正确解释你的意思

最佳答案

此方法可能不适用于 100M 行数据框

要创建 groupnum 列,您可以 ufunc.outergreater_equalless_equaldf 中的每个时间与 info 中的每个开始和停止进行比较,并使用 argmax 逐行获取其 True 位置。然后,您可以在此列上groupby,并在 2 秒上滚动

# create an boolean array to find in which range each row is
arr_bool = ( np.greater_equal.outer(df.Date.to_numpy(), info.Start.to_numpy())
& np.less_equal.outer(df.Date.to_numpy(), info.Stop.to_numpy()))

# use argmax to find the position of the first True row-wise
df['groupnum'] = arr_bool.argmax(axis=1)

# select only rows within ranges, use set_index for later rolling and index alignment
df = df.loc[arr_bool.any(axis=1), :].set_index('Date')

# groupby groupnum, do the sum for a closed interval of 2s
df['cum'] = df.groupby('groupnum').rolling('2s', closed = 'both').A.sum()\
.reset_index(level=0, drop=True) # for index alignment

df = df.reset_index() # get back date as a column
print (df)
Date A groupnum cum
0 2014-11-21 11:00:00 1 0 1.0
1 2014-11-21 11:00:03 2 0 2.0
2 2014-11-21 11:00:04 5 0 7.0
3 2014-11-21 11:00:05 3 0 10.0
4 2014-11-21 11:00:07 9 0 12.0
5 2014-11-21 11:00:08 6 1 6.0
6 2014-11-21 11:00:10 3 1 9.0
7 2014-11-21 11:00:11 0 1 3.0
8 2014-10-24 10:00:55 8 2 8.0
9 2014-10-24 10:00:59 10 2 10.0

编辑:如果arr_bool无法以这种方式创建您可以尝试迭代 info 的行并独立检查它是否高于 start 且低于 stop:

# get once an array of all dates (should be faster)
arr_date = df.Date.to_numpy()

# create groups by sum
df['groupnum'] = np.sum([i* (np.greater_equal(arr_date, start)&np.less_equal(arr_date, stop))
for i, (start, stop) in enumerate(zip(info.Start.to_numpy(), info.Stop.to_numpy()), 1)], axis=0) - 1

# remove the rows that are not in any range
df = df.loc[df['groupnum'].ge(0), :].set_index('Date')

# then same for the column cum
df['cum] = ...

关于python - 使用范围数据集返回 2 秒的累积和,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59734481/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com