gpt4 book ai didi

python - 计算 pandas DataFrame 的事件间隔

转载 作者:行者123 更新时间:2023-12-05 03:23:17 25 4
gpt4 key购买 nike

给定以下 Python 中的 pandas DataFrame:

date                  ID_bulb  switch   using_time          error
2022-02-05 14:30:21+00:00 5 OFF NaT INIT
2022-02-27 15:30:21+00:00 5 ON NaT ERROR2
2022-02-27 17:05:21+00:00 5 OFF 0 days 01:35:00 OK
2022-04-07 15:30:21+00:00 5 ON NaT OK
2022-04-07 15:30:21+00:00 5 OFF 0 days 00:00:00 OK
2022-04-07 17:05:21+00:00 5 OFF NaT ERROR2
2022-04-06 15:30:21+00:00 4 ON NaT INIT
2022-04-06 15:35:21+00:00 4 OFF NaT ERROR1
2022-04-06 16:10:21+00:00 4 ON NaT ERROR2
2022-04-07 15:30:21+00:00 4 OFF 0 days 23:20:00 OK
2022-04-07 17:05:21+00:00 4 ON NaT ERROR2
2022-01-01 19:40:21+00:00 3 ON NaT INIT
2022-02-03 22:40:21+00:00 3 ON NaT ERROR2
2022-02-03 23:20:21+00:00 3 OFF 0 days 00:40:00 OK
2022-02-04 00:20:21+00:00 3 ON NaT OK
2022-02-04 14:30:21+00:00 3 ON NaT ERROR2
2022-02-04 15:30:21+00:00 3 ON NaT ERROR2
2022-02-04 15:35:21+00:00 3 OFF 0 days 00:05:00 OK
2022-02-04 15:40:21+00:00 3 OFF NaT ERROR2
2022-02-04 19:40:21+00:00 3 ON NaT OK
2022-02-06 15:35:21+00:00 3 OFF 1 days 19:55:00 OK
2022-02-28 18:40:21+00:00 3 ON NaT ERROR1
2022-10-12 18:40:21+00:00 3 OFF 226 days 00:00:00 OK
2022-02-04 09:10:21+00:00 2 ON NaT OK
2022-02-04 14:10:21+00:00 2 ON NaT ERROR2

要解决的问题:我想添加一个名为cost_days 的新列。此列将仅包含变量 using_time 不同于 NaT 的行。在 start_time 定义的夜间时间到 end_time 之间连续至少 n 小时内灯泡亮了多少次的信息。

建议的解决方案,基于@keramat 的想法。

def rounder(x):
# Fixed parameters, to be at least 5 hours in the interval from 22:00 to 07:00
n = 5
start_date = "22:00"
end_date = "07:00"
# assert (n+1) < time_slot
time_1 = datetime.strptime(start_date,"%H:%M")
time_2 = datetime.strptime(end_date,"%H:%M")
time_slot = (time_2 - time_1).seconds // 3600
v = pd.date_range(list(x)[-2], list(x)[-1], freq='1h')
temp = pd.Series(v, index = v).between_time(start_date, end_date)
temp = len(temp)/time_slot
return np.floor(temp) if np.mod(temp, 1.0) < (n+1)/time_slot else np.ceil(temp)/time_slot

g = (df['using_time'].notna()).sort_index(ascending=False).cumsum()
g = (g-max(g)).abs()

temp = df.groupby(g)['date'].apply(lambda x: rounder(x))

#Up to this point, it runs perfectly.


df.loc[df[df['using_time'].notna()].index, 'cost_days']=temp.values

# ValueError: shape mismatch: value array of shape (8,) could not be broadcast to indexing result of shape (7,)

df['cost_days'] = df['cost_days'].fillna(0)

print(df)

我需要修正错误以获得我想要的结果。

生成的数据框应如下所示:

date                  ID_bulb  switch   using_time          error    cost_days
2022-02-05 14:30:21+00:00 5 OFF NaT INIT 0
2022-02-27 15:30:21+00:00 5 ON NaT ERROR2 0
2022-02-27 17:05:21+00:00 5 OFF 0 days 01:35:00 OK 0
2022-04-07 15:30:21+00:00 5 ON NaT OK 0
2022-04-07 15:30:21+00:00 5 OFF 0 days 00:00:00 OK 0
2022-04-07 17:05:21+00:00 5 OFF NaT ERROR2 0
2022-04-06 15:30:21+00:00 4 ON NaT INIT 0
2022-04-06 15:35:21+00:00 4 OFF NaT ERROR1 0
2022-04-06 16:10:21+00:00 4 ON NaT ERROR2 0
2022-04-07 15:30:21+00:00 4 OFF 0 days 23:20:00 OK 1
2022-04-07 17:05:21+00:00 4 ON NaT ERROR2 0
2022-01-01 19:40:21+00:00 3 ON NaT INIT 0
2022-02-03 22:40:21+00:00 3 ON NaT ERROR2 0
2022-02-03 23:20:21+00:00 3 OFF 0 days 00:40:00 OK 0
2022-02-04 00:20:21+00:00 3 ON NaT OK 0
2022-02-04 14:30:21+00:00 3 ON NaT ERROR2 0
2022-02-04 15:30:21+00:00 3 ON NaT ERROR2 0
2022-02-04 15:35:21+00:00 3 OFF 0 days 00:05:00 OK 0
2022-02-04 15:40:21+00:00 3 OFF NaT ERROR2 0
2022-02-04 19:40:21+00:00 3 ON NaT OK 0
2022-02-06 15:35:21+00:00 3 OFF 1 days 19:55:00 OK 2
2022-02-28 18:40:21+00:00 3 ON NaT ERROR1 0
2022-10-12 18:40:21+00:00 3 OFF 226 days 00:00:00 OK 226
2022-02-04 09:10:21+00:00 2 ON NaT OK 0
2022-02-04 14:10:21+00:00 2 ON NaT ERROR2 0

编辑:我认为问题在于输入数据集必须以非空 using_time 值结尾才能使上述代码正常工作。如何通过解决此问题获得我想要的性能?

最佳答案

使用:

if df['using_time'].iloc[-1]=='NaT':
g = g[g!=g.max()]

请注意,当我使用您的字符串制作 df 时,使用时间是字符串类型。在您的情况下,您可能需要执行以下操作:

if df['using_time'].iloc[-1].isna():
g = g[g!=g.max()]

演示:

string = """date   ID_bulb   switch   using_time   error
2022-02-05 14:30:21+00:00 5 OFF NaT INIT
2022-02-27 15:30:21+00:00 5 ON NaT ERROR2
2022-02-27 17:05:21+00:00 5 OFF 0 days 01:35:00 OK
2022-04-07 15:30:21+00:00 5 ON NaT OK
2022-04-07 15:30:21+00:00 5 OFF 0 days 00:00:00 OK
2022-04-07 17:05:21+00:00 5 OFF NaT ERROR2
2022-04-06 15:30:21+00:00 4 ON NaT INIT
2022-04-06 15:35:21+00:00 4 OFF NaT ERROR1
2022-04-06 16:10:21+00:00 4 ON NaT ERROR2
2022-04-07 15:30:21+00:00 4 OFF 0 days 23:20:00 OK
2022-04-07 17:05:21+00:00 4 ON NaT ERROR2
2022-01-01 19:40:21+00:00 3 ON NaT INIT
2022-02-03 22:40:21+00:00 3 ON NaT ERROR2
2022-02-03 23:20:21+00:00 3 OFF 0 days 00:40:00 OK
2022-02-04 00:20:21+00:00 3 ON NaT OK
2022-02-04 14:30:21+00:00 3 ON NaT ERROR2
2022-02-04 15:30:21+00:00 3 ON NaT ERROR2
2022-02-04 15:35:21+00:00 3 OFF 0 days 00:05:00 OK
2022-02-04 15:40:21+00:00 3 OFF NaT ERROR2
2022-02-04 19:40:21+00:00 3 ON NaT OK
2022-02-06 15:35:21+00:00 3 OFF 1 days 19:55:00 OK
2022-02-28 18:40:21+00:00 3 ON NaT ERROR1
2022-10-12 18:40:21+00:00 3 OFF 226 days 00:00:00 OK
2022-02-04 09:10:21+00:00 2 ON NaT OK
2022-02-04 14:10:21+00:00 2 ON NaT ERROR2"""

data = [x.split(' ') for x in string.split('\n')]
df = pd.DataFrame(data[1:], columns = data[0])

from datetime import datetime
import numpy as np

def rounder(x):
# Fixed parameters, to be at least 5 hours in the interval from 22:00 to 07:00
n = 5
start_date = "22:00"
end_date = "07:00"
# assert (n+1) < time_slot
time_1 = datetime.strptime(start_date,"%H:%M")
time_2 = datetime.strptime(end_date,"%H:%M")
time_slot = (time_2 - time_1).seconds // 3600
v = pd.date_range(list(x)[-2], list(x)[-1], freq='1h')
temp = pd.Series(v, index = v).between_time(start_date, end_date)
temp = len(temp)/time_slot
return np.floor(temp) if np.mod(temp, 1.0) < (n+1)/time_slot else np.ceil(temp)/time_slot

g = (df['using_time']!='NaT').sort_index(ascending=False).cumsum()
g = (g-max(g)).abs()
if df['using_time'].iloc[-1]=='NaT':
g = g[g!=g.max()]
temp = df.groupby(g)['date'].apply(lambda x: rounder(x))

df.loc[df[df['using_time']!='NaT'].index, 'cost_days']=temp.values

# ValueError: shape mismatch: value array of shape (8,) could not be broadcast to indexing result of shape (7,)

df['cost_days'] = df['cost_days'].fillna(0)

df

输出:

date    ID_bulb switch  using_time  error   cost_days
0 2022-02-05 14:30:21+00:00 5 OFF NaT INIT 0.0
1 2022-02-27 15:30:21+00:00 5 ON NaT ERROR2 0.0
2 2022-02-27 17:05:21+00:00 5 OFF 0 days 01:35:00 OK 0.0
3 2022-04-07 15:30:21+00:00 5 ON NaT OK 0.0
4 2022-04-07 15:30:21+00:00 5 OFF 0 days 00:00:00 OK 0.0
5 2022-04-07 17:05:21+00:00 5 OFF NaT ERROR2 0.0
6 2022-04-06 15:30:21+00:00 4 ON NaT INIT 0.0
7 2022-04-06 15:35:21+00:00 4 OFF NaT ERROR1 0.0
8 2022-04-06 16:10:21+00:00 4 ON NaT ERROR2 0.0
9 2022-04-07 15:30:21+00:00 4 OFF 0 days 23:20:00 OK 1.0
10 2022-04-07 17:05:21+00:00 4 ON NaT ERROR2 0.0
11 2022-01-01 19:40:21+00:00 3 ON NaT INIT 0.0
12 2022-02-03 22:40:21+00:00 3 ON NaT ERROR2 0.0
13 2022-02-03 23:20:21+00:00 3 OFF 0 days 00:40:00 OK 0.0
14 2022-02-04 00:20:21+00:00 3 ON NaT OK 0.0
15 2022-02-04 14:30:21+00:00 3 ON NaT ERROR2 0.0
16 2022-02-04 15:30:21+00:00 3 ON NaT ERROR2 0.0
17 2022-02-04 15:35:21+00:00 3 OFF 0 days 00:05:00 OK 0.0
18 2022-02-04 15:40:21+00:00 3 OFF NaT ERROR2 0.0
19 2022-02-04 19:40:21+00:00 3 ON NaT OK 0.0
20 2022-02-06 15:35:21+00:00 3 OFF 1 days 19:55:00 OK 2.0
21 2022-02-28 18:40:21+00:00 3 ON NaT ERROR1 0.0
22 2022-10-12 18:40:21+00:00 3 OFF 226 days 00:00:00 OK 226.0
23 2022-02-04 09:10:21+00:00 2 ON NaT OK 0.0
24 2022-02-04 14:10:21+00:00 2 ON NaT ERROR2 0.0

关于python - 计算 pandas DataFrame 的事件间隔,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/72556634/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com