gpt4 book ai didi

python - 如何使用 Pandas 识别近似(阈值定义)连续的非空数据?

转载 作者:行者123 更新时间:2023-11-28 21:50:00 25 4
gpt4 key购买 nike

我想从降雨时间序列中提取降雨事件,同时在同一事件中允许 X 干燥小时数(作为参数)。因此,对于降雨事件,我的意思是近似连续降雨 (RF > 0),内部最大连续 X 个干燥小时数 (RF = 0)。

我实际上不想用迭代器和增量来做这件事,我正在寻找可以缓解压力的 pandas 或 numpy/scipy 工具。

这是我的数据框示例。 RF 是原始降雨,RFfill 是用于填充无数据的 RF.interpolate()。 evtId 是为存储事件唯一 ID 而创建的字段。

                    TS   RF  RFfill  evtId
0 1997-11-27 14:00:00 0.3 0.3 NaN
1 1997-11-27 15:00:00 1.1 1.1 NaN
2 1997-11-27 16:00:00 0.2 0.2 NaN
3 1997-11-27 17:00:00 0.0 0.0 NaN
4 1997-11-27 18:00:00 0.0 0.0 NaN
5 1997-11-27 19:00:00 1.1 1.1 NaN
6 1997-11-27 20:00:00 0.6 0.6 NaN
7 1997-11-27 21:00:00 0.0 0.0 NaN
8 1997-11-27 22:00:00 0.0 0.0 NaN
9 1997-11-27 23:00:00 0.0 0.0 NaN
10 1997-11-28 00:00:00 0.0 0.0 NaN
11 1997-11-28 01:00:00 0.0 0.0 NaN
12 1997-11-28 02:00:00 0.0 0.0 NaN
13 1997-11-28 03:00:00 0.0 0.0 NaN
14 1997-11-28 04:00:00 0.0 0.0 NaN
15 1997-11-28 05:00:00 0.0 0.0 NaN
16 1997-11-28 06:00:00 0.0 0.0 NaN
17 1997-11-28 07:00:00 0.0 0.0 NaN
18 1997-11-28 08:00:00 0.0 0.0 NaN
19 1997-11-28 09:00:00 0.8 0.8 NaN
20 1997-11-28 10:00:00 1.1 1.1 NaN
21 1997-11-28 11:00:00 2.3 2.3 NaN
22 1997-11-28 12:00:00 1.4 1.4 NaN
23 1997-11-28 13:00:00 0.4 0.4 NaN
24 1997-11-28 14:00:00 0.2 0.2 NaN
25 1997-11-28 15:00:00 0.0 0.0 NaN
26 1997-11-28 16:00:00 0.0 0.0 NaN
27 1997-11-28 17:00:00 0.0 0.0 NaN
28 1997-11-28 18:00:00 0.0 0.0 NaN
29 1997-11-28 19:00:00 0.0 0.0 NaN
30 1997-11-28 20:00:00 0.0 0.0 NaN

这是允许干燥时间为 5 小时的预期输出:

                    TS   RF  RFfill  evtId
0 1997-11-27 14:00:00 0.3 0.3 0
1 1997-11-27 15:00:00 1.1 1.1 0
2 1997-11-27 16:00:00 0.2 0.2 0
3 1997-11-27 17:00:00 0.0 0.0 0
4 1997-11-27 18:00:00 0.0 0.0 0
5 1997-11-27 19:00:00 1.1 1.1 0
6 1997-11-27 20:00:00 0.6 0.6 0
7 1997-11-27 21:00:00 0.0 0.0 NaN
8 1997-11-27 22:00:00 0.0 0.0 NaN
9 1997-11-27 23:00:00 0.0 0.0 NaN
10 1997-11-28 00:00:00 0.0 0.0 NaN
11 1997-11-28 01:00:00 0.0 0.0 NaN
12 1997-11-28 02:00:00 0.0 0.0 NaN
13 1997-11-28 03:00:00 0.0 0.0 NaN
14 1997-11-28 04:00:00 0.0 0.0 NaN
15 1997-11-28 05:00:00 0.0 0.0 NaN
16 1997-11-28 06:00:00 0.0 0.0 NaN
17 1997-11-28 07:00:00 0.0 0.0 NaN
18 1997-11-28 08:00:00 0.0 0.0 NaN
19 1997-11-28 09:00:00 0.8 0.8 1
20 1997-11-28 10:00:00 1.1 1.1 1
21 1997-11-28 11:00:00 2.3 2.3 1
22 1997-11-28 12:00:00 1.4 1.4 1
23 1997-11-28 13:00:00 0.4 0.4 1
24 1997-11-28 14:00:00 0.2 0.2 1
25 1997-11-28 15:00:00 0.0 0.0 NaN
26 1997-11-28 16:00:00 0.0 0.0 NaN
27 1997-11-28 17:00:00 0.0 0.0 NaN
28 1997-11-28 18:00:00 0.0 0.0 NaN
29 1997-11-28 19:00:00 0.0 0.0 NaN
30 1997-11-28 20:00:00 0.0 0.0 NaN

有什么想法可以帮助我实现这一目标吗?

最佳答案

import numpy as np
import pandas as pd
import scipy.ndimage as ndimage

df = pd.DataFrame({'RF': [ 0.3, 1.1, 0.2, 0. , 0. , 0. , 0. , 0. ,
1.1, 0.6, 0. , 0. , 0. , 0. , 0. , 0. ,
0.8, 1.1, 2.3, 1.4, 0.4, 0.2, 0. , 0. ,
0. , 0. , 0. , 0. ]})

consecutive = 5
mask = df['RF'] > 0
df['mask'] = mask
df['dilation'] = ndimage.binary_dilation(mask, structure=[1]*(consecutive+1))
df['erosion'] = ndimage.binary_erosion(df['dilation'],
structure=[1]*(consecutive+1), border_value=1)
df['labeled'], nobjs = ndimage.label(df['erosion'])
df['evtId'] = np.where(df['labeled'] > 0, df['labeled']-1, np.nan)
print(df[['RF', 'evtId']])

产量

#      RF  evtId
# 0 0.3 0
# 1 1.1 0
# 2 0.2 0
# 3 0.0 0
# 4 0.0 0
# 5 0.0 0
# 6 0.0 0
# 7 0.0 0
# 8 1.1 0
# 9 0.6 0
# 10 0.0 NaN
# 11 0.0 NaN
# 12 0.0 NaN
# 13 0.0 NaN
# 14 0.0 NaN
# 15 0.0 NaN
# 16 0.8 1
# 17 1.1 1
# 18 2.3 1
# 19 1.4 1
# 20 0.4 1
# 21 0.2 1
# 22 0.0 NaN
# 23 0.0 NaN
# 24 0.0 NaN
# 25 0.0 NaN
# 26 0.0 NaN
# 27 0.0 NaN

说明:首先准备一个二进制掩码,它是True where df['RF'] > 0:

mask = (df['RF'] > 0)
df['mask'] = mask
# RF mask
# 0 0.3 True
# 1 1.1 True
# 2 0.2 True
# 3 0.0 False
# 4 0.0 False
# 5 0.0 False
# 6 0.0 False
# 7 0.0 False
# 8 1.1 True
# 9 0.6 True
# ...

接下来,dilate将由 5 个或更少的 False(非雨天)分隔的 True(雨天)岛连接在一起的掩码:

df['dilation'] = ndimage.binary_dilation(mask, structure=[1]*(consecutive+1))
# RF mask dilation
# 0 0.3 True True
# 1 1.1 True True
# 2 0.2 True True
# 3 0.0 False True <--,
# 4 0.0 False True |
# 5 0.0 False True | dilation filled over 5 rainy days
# 6 0.0 False True |
# 7 0.0 False True <--'
# 8 1.1 True True
# 9 0.6 True True
# 10 0.0 False True <-- But the `True`s extend a bit too far
# 11 0.0 False True <--
# 12 0.0 False False
# 13 0.0 False True
# 14 0.0 False True
# 15 0.0 False True
# 16 0.8 True True
# 17 1.1 True True
# 18 2.3 True True
# 19 1.4 True True
# 20 0.4 True True
# 21 0.2 True True
# 22 0.0 False True
# 23 0.0 False True
# 24 0.0 False False
# 25 0.0 False False
# 26 0.0 False False
# 27 0.0 False False

下次使用binary erosion删除延伸过远的 True

df['erosion'] = ndimage.binary_erosion(df['dilation'], structure=[1]*(consecutive+1), 
border_value=1)
# RF mask dilation erosion
# 0 0.3 True True True
# 1 1.1 True True True
# 2 0.2 True True True
# 3 0.0 False True True
# 4 0.0 False True True
# 5 0.0 False True True
# 6 0.0 False True True
# 7 0.0 False True True
# 8 1.1 True True True
# 9 0.6 True True True
# 10 0.0 False True False <--,
# 11 0.0 False True False |
# 12 0.0 False False False | The Falses have been expanded
# 13 0.0 False True False | (The Trues eroded)
# 14 0.0 False True False |
# 15 0.0 False True False <--'
# 16 0.8 True True True
# 17 1.1 True True True
# 18 2.3 True True True
# 19 1.4 True True True
# 20 0.4 True True True
# 21 0.2 True True True
# 22 0.0 False True False
# 23 0.0 False True False
# 24 0.0 False False False
# 25 0.0 False False False
# 26 0.0 False False False
# 27 0.0 False False False

现在 True 表示“降雨事件”,我们可以使用 ndimage.label 为每个降雨事件分配一个唯一的编号。 :

df['labeled'], nobjs = ndimage.label(df['erosion'])
# RF mask dilation erosion labeled
# 0 0.3 True True True 1
# 1 1.1 True True True 1
# 2 0.2 True True True 1
# 3 0.0 False True True 1
# 4 0.0 False True True 1
# 5 0.0 False True True 1
# 6 0.0 False True True 1
# 7 0.0 False True True 1
# 8 1.1 True True True 1
# 9 0.6 True True True 1
# 10 0.0 False True False 0
# 11 0.0 False True False 0
# 12 0.0 False False False 0
# 13 0.0 False True False 0
# 14 0.0 False True False 0
# 15 0.0 False True False 0
# 16 0.8 True True True 2
# 17 1.1 True True True 2
# 18 2.3 True True True 2
# 19 1.4 True True True 2
# 20 0.4 True True True 2
# 21 0.2 True True True 2
# 22 0.0 False True False 0
# 23 0.0 False True False 0
# 24 0.0 False False False 0
# 25 0.0 False False False 0
# 26 0.0 False False False 0
# 27 0.0 False False False 0

并在df['labeled'] > 0时使用np.where将标签号减一,赋值给np.nan 否则:

df['evtId'] = np.where(df['labeled'] > 0, df['labeled']-1, np.nan)
# RF mask dilation erosion labeled evtId
# 0 0.3 True True True 1 0
# 1 1.1 True True True 1 0
# 2 0.2 True True True 1 0
# 3 0.0 False True True 1 0
# 4 0.0 False True True 1 0
# 5 0.0 False True True 1 0
# 6 0.0 False True True 1 0
# 7 0.0 False True True 1 0
# 8 1.1 True True True 1 0
# 9 0.6 True True True 1 0
# 10 0.0 False True False 0 NaN
# 11 0.0 False True False 0 NaN
# 12 0.0 False False False 0 NaN
# 13 0.0 False True False 0 NaN
# 14 0.0 False True False 0 NaN
# 15 0.0 False True False 0 NaN
# 16 0.8 True True True 2 1
# 17 1.1 True True True 2 1
# 18 2.3 True True True 2 1
# 19 1.4 True True True 2 1
# 20 0.4 True True True 2 1
# 21 0.2 True True True 2 1
# 22 0.0 False True False 0 NaN
# 23 0.0 False True False 0 NaN
# 24 0.0 False False False 0 NaN
# 25 0.0 False False False 0 NaN
# 26 0.0 False False False 0 NaN
# 27 0.0 False False False 0 NaN

请注意,先膨胀后腐 eclipse 称为 closing .原因为什么我使用 ndimage.binary_dilationndimage.binary_erosion 而不是只是调用 ndimage.binary_closing 是因为我需要设置border_value=1 以防止边框边缘被侵 eclipse 。比较 df['erosion']

ndimage.binary_closing(mask, structure=[1]*(consecutive+1))

您会看到不同之处。

关于python - 如何使用 Pandas 识别近似(阈值定义)连续的非空数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32520993/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com