gpt4 book ai didi

python - 用停止填充 pandas 系列中的 NA 值

转载 作者:太空狗 更新时间:2023-10-30 02:46:02 25 4
gpt4 key购买 nike

我正在分析一个时间序列,根据特定标准,我可以挑选出事件的开始结束 行。此时,我的系列看起来像这样(为简洁起见,我省略了一些重复的值):

设置

import numpy as np
import pandas
from pandas import Timestamp

datadict = {'event': {
Timestamp('2010-01-01 00:20:00', tz=None): 'event start',
Timestamp('2010-01-01 00:30:00', tz=None): '--',
Timestamp('2010-01-01 00:40:00', tz=None): '--',
Timestamp('2010-01-01 00:50:00', tz=None): '--',
Timestamp('2010-01-01 01:00:00', tz=None): '--',
Timestamp('2010-01-01 01:10:00', tz=None): 'event end',
Timestamp('2010-01-01 01:20:00', tz=None): '--',
Timestamp('2010-01-01 02:20:00', tz=None): '--',
Timestamp('2010-01-01 02:30:00', tz=None): 'event start',
Timestamp('2010-01-01 02:40:00', tz=None): '--',
Timestamp('2010-01-01 02:50:00', tz=None): '--',
Timestamp('2010-01-01 03:00:00', tz=None): '--',
Timestamp('2010-01-01 03:10:00', tz=None): '--',
Timestamp('2010-01-01 03:20:00', tz=None): '--',
Timestamp('2010-01-01 03:30:00', tz=None): 'event end',
}}
data = pandas.DataFrame.from_dict(datadict)

event
2010-01-01 00:20:00 event start
2010-01-01 00:30:00 --
2010-01-01 00:40:00 --
2010-01-01 00:50:00 --
2010-01-01 01:00:00 --
2010-01-01 01:10:00 event end
2010-01-01 01:20:00 --
2010-01-01 02:20:00 --
2010-01-01 02:30:00 event start
2010-01-01 02:40:00 --
2010-01-01 02:50:00 --
2010-01-01 03:00:00 --
2010-01-01 03:10:00 --
2010-01-01 03:20:00 --
2010-01-01 03:30:00 event end

这是我想要实现的(理想情况下没有 for 循环)

                           event  event number
2010-01-01 00:20:00 event start 1
2010-01-01 00:30:00 -- 1
2010-01-01 00:40:00 -- 1
2010-01-01 00:50:00 -- 1
2010-01-01 01:00:00 -- 1
2010-01-01 01:10:00 event end 1
2010-01-01 01:20:00 -- NA
2010-01-01 02:20:00 -- NA
2010-01-01 02:30:00 event start 2
2010-01-01 02:40:00 -- 2
2010-01-01 02:50:00 -- 2
2010-01-01 03:00:00 -- 2
2010-01-01 03:10:00 -- 2
2010-01-01 03:20:00 -- 2
2010-01-01 03:30:00 event end 2
2010-01-01 03:40:00 -- NA
2010-01-01 03:50:00 -- NA

这是我试过的

通过对我的数据质量的一些乐观假设,我可以获得这样的事件编号:

table = data[data.event != '--'].reset_index()
table['event number'] = 1 + np.floor(table.index / 2)
table = table.set_index('index')

event event number
index
2010-01-01 00:20:00 event start 1
2010-01-01 01:10:00 event end 1
2010-01-01 02:30:00 event start 2
2010-01-01 03:30:00 event end 2

然后我可以将其加入到我的原始数据框,并使用 method='ffill'

fillna
data2 = data.join(table[['event number']])
data2['filled'] = data2['event number'].fillna(method='ffill')

event event number filled
2010-01-01 00:20:00 event start 1 1
2010-01-01 00:30:00 -- NaN 1
2010-01-01 00:40:00 -- NaN 1
2010-01-01 00:50:00 -- NaN 1
2010-01-01 01:00:00 -- NaN 1
2010-01-01 01:10:00 event end 1 1
2010-01-01 01:20:00 -- NaN 1 # <- d'oh
2010-01-01 02:20:00 -- NaN 1 # <- d'oh
2010-01-01 02:30:00 event start 2 2
2010-01-01 02:40:00 -- NaN 2
2010-01-01 02:50:00 -- NaN 2
2010-01-01 03:00:00 -- NaN 2
2010-01-01 03:10:00 -- NaN 2
2010-01-01 03:20:00 -- NaN 2
2010-01-01 03:30:00 event end 2 2

问题

如您所见,事件之间的时间(01:20 到 02:20)与事件 #1 相关联。

问题

是否可以跳过这些部分而不循环?

最佳答案

您可以通过查看 event start 的数量和 event end 的数量的累积总和来实现:

>>> data['event number'] = (data.event == 'event start').cumsum()
>>> data
event event number
2010-01-01 00:20:00 event start 1
2010-01-01 00:30:00 -- 1
2010-01-01 00:40:00 -- 1
2010-01-01 00:50:00 -- 1
2010-01-01 01:00:00 -- 1
2010-01-01 01:10:00 event end 1
2010-01-01 01:20:00 -- 1
2010-01-01 02:20:00 -- 1
2010-01-01 02:30:00 event start 2
2010-01-01 02:40:00 -- 2
2010-01-01 02:50:00 -- 2
2010-01-01 03:00:00 -- 2
2010-01-01 03:10:00 -- 2
2010-01-01 03:20:00 -- 2
2010-01-01 03:30:00 event end 2

现在只需要在没有事件的时候设置为nan;但那些地方对应于 event start 的累积总和等于 event end 的累积总和的行(移动 1 行)

>>> idx = data['event number'] == (data.event.shift(1) == 'event end').cumsum()
>>> data.loc[idx, 'event number'] = np.nan
>>> data
event event number
2010-01-01 00:20:00 event start 1
2010-01-01 00:30:00 -- 1
2010-01-01 00:40:00 -- 1
2010-01-01 00:50:00 -- 1
2010-01-01 01:00:00 -- 1
2010-01-01 01:10:00 event end 1
2010-01-01 01:20:00 -- NaN
2010-01-01 02:20:00 -- NaN
2010-01-01 02:30:00 event start 2
2010-01-01 02:40:00 -- 2
2010-01-01 02:50:00 -- 2
2010-01-01 03:00:00 -- 2
2010-01-01 03:10:00 -- 2
2010-01-01 03:20:00 -- 2
2010-01-01 03:30:00 event end 2

[15 rows x 2 columns]

关于python - 用停止填充 pandas 系列中的 NA 值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22290793/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com