作者热门文章
- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我有一个带有 StartDate
和结束 EndDate
列的 df
df.loc[:,['StartDate','EndDate']].head()
Out[92]:
StartDate EndDate
0 2016-05-19 14:19:14.820002 2016-05-19 14:19:17.899999
1 2016-05-19 14:19:32.119999 2016-05-19 14:19:37.020002
我想获得任意频率的 df2 ,例如每个 bin 中包含在(StartDate,EndDate)间隔之间的时间量例如
df2 ('1s')
2016-05-19 14:19:14.000000 0.179998
2016-05-19 14:19:15.000000 1
2016-05-19 14:19:16.000000 1
2016-05-19 14:19:17.000000 0.89999
2016-05-19 14:19:18.000000 0
当然
groupby(StartDate.date.dt)['Duration']
其中“持续时间”为“EndDate”-“StartDate”
不起作用
最佳答案
import numpy as np
import pandas as pd
df = pd.DataFrame({'StartDate':['2016-05-19 14:19:14.820002','2016-05-19 14:19:32.119999', '2016-05-19 14:19:17.899999'],
'EndDate':['2016-05-19 14:19:17.899999', '2016-05-19 14:19:37.020002', '2016-05-19 14:19:18.5']})
df2 = pd.melt(df, var_name='type', value_name='date')
df2['date'] = pd.to_datetime(df2['date'])
df2['sign'] = np.where(df2['type']=='StartDate', 1, -1)
min_date = df2['date'].min().to_period('1s').to_timestamp()
max_date = (df2['date'].max() + pd.Timedelta('1s')).to_period('1s').to_timestamp()
index = pd.date_range(min_date, df2['date'].max(), freq='1s').union(df2['date'])
df2 = df2.groupby('date').sum()
df2 = df2.reindex(index)
df2['weight'] = df2['sign'].fillna(0).cumsum()
df2['duration'] = 0
df2.iloc[:-1, df2.columns.get_loc('duration')] = (df2.index[1:] - df2.index[:-1]).total_seconds()
df2['duration'] = df2['duration'] * df2['weight']
df2 = df2.resample('1s').sum()
print(df2)
产量
sign weight duration
2016-05-19 14:19:14 1.0 1.0 0.179998
2016-05-19 14:19:15 0.0 1.0 1.000000
2016-05-19 14:19:16 0.0 1.0 1.000000
2016-05-19 14:19:17 0.0 3.0 1.000000
2016-05-19 14:19:18 -1.0 1.0 0.500000
2016-05-19 14:19:19 0.0 0.0 0.000000
2016-05-19 14:19:20 0.0 0.0 0.000000
2016-05-19 14:19:21 0.0 0.0 0.000000
2016-05-19 14:19:22 0.0 0.0 0.000000
2016-05-19 14:19:23 0.0 0.0 0.000000
2016-05-19 14:19:24 0.0 0.0 0.000000
2016-05-19 14:19:25 0.0 0.0 0.000000
2016-05-19 14:19:26 0.0 0.0 0.000000
2016-05-19 14:19:27 0.0 0.0 0.000000
2016-05-19 14:19:28 0.0 0.0 0.000000
2016-05-19 14:19:29 0.0 0.0 0.000000
2016-05-19 14:19:30 0.0 0.0 0.000000
2016-05-19 14:19:31 0.0 0.0 0.000000
2016-05-19 14:19:32 1.0 1.0 0.880001
2016-05-19 14:19:33 0.0 1.0 1.000000
2016-05-19 14:19:34 0.0 1.0 1.000000
2016-05-19 14:19:35 0.0 1.0 1.000000
2016-05-19 14:19:36 0.0 1.0 1.000000
2016-05-19 14:19:37 -1.0 1.0 0.020002
<小时/>
主要思想是将 StartDate
和 EndDate
放在一列中,并分配+1 到每个 StartDate
,-1
到每个 EndDate
:
df2 = pd.melt(df, var_name='type', value_name='date')
df2['date'] = pd.to_datetime(df2['date'])
df2['sign'] = np.where(df2['type']=='StartDate', 1, -1)
# type date sign
# 0 StartDate 2016-05-19 14:19:14.820002 1
# 1 StartDate 2016-05-19 14:19:32.119999 1
# 2 EndDate 2016-05-19 14:19:17.899999 -1
# 3 EndDate 2016-05-19 14:19:37.020002 -1
现在将 date
设置为索引,然后重新索引 DataFrame 以包含频率为 1 秒的所有时间戳:
min_date = df2['date'].min().to_period('1s').to_timestamp()
max_date = (df2['date'].max() + pd.Timedelta('1s')).to_period('1s').to_timestamp()
index = pd.date_range(min_date, df2['date'].max(), freq='1s').union(df2['date'])
df2 = df2.set_index('date')
df2 = df2.reindex(index)
# type sign
# 2016-05-19 14:19:14.000000 NaN NaN
# 2016-05-19 14:19:14.820002 StartDate 1.0
# 2016-05-19 14:19:15.000000 NaN NaN
# 2016-05-19 14:19:16.000000 NaN NaN
# 2016-05-19 14:19:17.000000 NaN NaN
# 2016-05-19 14:19:17.899999 EndDate -1.0
# 2016-05-19 14:19:18.000000 NaN NaN
# ...
在 sign
列中,用 0 填充 NaN 值并计算累积和:
df2['weight'] = df2['sign'].fillna(0).cumsum()
# type sign weight
# 2016-05-19 14:19:14.000000 NaN NaN 0.0
# 2016-05-19 14:19:14.820002 StartDate 1.0 1.0
# 2016-05-19 14:19:15.000000 NaN NaN 1.0
# 2016-05-19 14:19:16.000000 NaN NaN 1.0
# 2016-05-19 14:19:17.000000 NaN NaN 1.0
# 2016-05-19 14:19:17.899999 EndDate -1.0 0.0
# 2016-05-19 14:19:18.000000 NaN NaN 0.0
# ...
计算每行之间的持续时间:
df2['duration'] = 0
df2.iloc[:-1, df2.columns.get_loc('duration')] = (df2.index[1:] - df2.index[:-1]).total_seconds()
df2['duration'] = df2['duration'] * df2['weight']
# type sign weight duration
# 2016-05-19 14:19:14.000000 NaN NaN 0.0 0.000000
# 2016-05-19 14:19:14.820002 StartDate 1.0 1.0 0.179998
# 2016-05-19 14:19:15.000000 NaN NaN 1.0 1.000000
# 2016-05-19 14:19:16.000000 NaN NaN 1.0 1.000000
# 2016-05-19 14:19:17.000000 NaN NaN 1.0 0.899999
# 2016-05-19 14:19:17.899999 EndDate -1.0 0.0 0.000000
# 2016-05-19 14:19:18.000000 NaN NaN 0.0 0.000000
最后,将 DataFrame 重新采样为 1 秒频率
df2 = df2.resample('1s').sum()
<小时/>
这个技巧是我从 DSM, here 那里学到的.
关于python - 如何在 pandas 中重新存储而不是分组间隔,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54961267/
我是一名优秀的程序员,十分优秀!