gpt4 book ai didi

python - 使用时间窗口的标签回填缺失数据

转载 作者:太空宇宙 更新时间:2023-11-04 04:47:35 24 4
gpt4 key购买 nike

我想根据不同标签的时间(1 天,2 天)回填每一列。这是代码:

from datetime import datetime, timedelta
import pandas as pd
import numpy as np
import random
np.random.seed(11)


date_today = datetime.now()
ndays = 15
df = pd.DataFrame({'date': [date_today + timedelta(days=x) for x in range(ndays)],
'test': pd.Series(np.random.randn(ndays)), 'test2':pd.Series(np.random.randn(ndays))})

df = df.set_index('date')
df = df.mask(np.random.random(df.shape) < .7)
print(df) # this will be the dataset that I generate for this question

# my orginal data set have labels that is why I convert it to str
df['test']=df['test'].astype(str)
df['test2']=df['test2'].astype(str)
df.replace('nan', np.nan, inplace = True)

for I in df.dropna().index.values:
end=I
start=end-np.timedelta64(24,'h')
start2=end-np.timedelta64(48,'h')
df[(df.index >= start) & (df.index <= end)]=df[(df.index >= start) & (df.index <= end)].bfill()

我的初始数据集将如下所示:

                                test     test2
date
2018-03-07 11:28:23.028856 NaN NaN
2018-03-08 11:28:23.028856 NaN NaN
2018-03-09 11:28:23.028856 -0.484565 1.574634
2018-03-10 11:28:23.028856 -2.653319 NaN
2018-03-11 11:28:23.028856 NaN NaN
2018-03-12 11:28:23.028856 NaN NaN
2018-03-13 11:28:23.028856 -0.536629 NaN
2018-03-14 11:28:23.028856 NaN 0.725752
2018-03-15 11:28:23.028856 NaN 1.549072
2018-03-16 11:28:23.028856 -1.065603 0.630080
2018-03-17 11:28:23.028856 NaN NaN
2018-03-18 11:28:23.028856 -0.475733 0.732271
2018-03-19 11:28:23.028856 NaN -0.642575
2018-03-20 11:28:23.028856 NaN -0.178093
2018-03-21 11:28:23.028856 NaN -0.573955

我想要得到的是这样的:我尝试了不同的方法,但我找不到使用 bfill 的方法,bfill 不获取任何值参数,而 fillna 仅获取方法或值。

                                test     test2
date
2018-03-07 11:28:23.028856 -0.484565_2D 1.574634_2D
2018-03-08 11:28:23.028856 -0.484565_D 1.574634_D
2018-03-09 11:28:23.028856 -0.484565 1.574634
2018-03-10 11:28:23.028856 -2.653319 NaN
2018-03-11 11:28:23.028856 -0.536629_2D NaN
2018-03-12 11:28:23.028856 -0.536629_D 0.725752_2D
2018-03-13 11:28:23.028856 -0.536629 0.725752_D
2018-03-14 11:28:23.028856 -1.065603_2D 0.725752
2018-03-15 11:28:23.028856 -1.065603_D 1.549072
2018-03-16 11:28:23.028856 -1.065603 0.630080
2018-03-17 11:28:23.028856 -0.475733_D 0.732271_D
2018-03-18 11:28:23.028856 -0.475733 0.732271
2018-03-19 11:28:23.028856 NaN -0.642575
2018-03-20 11:28:23.028856 NaN -0.178093
2018-03-21 11:28:23.028856 NaN -0.573955

更新:我的原始数据集的时间戳是不统一的,因此这段代码创建了类似的时间戳:

date_today = datetime.now()
ndays = 15
df = pd.DataFrame({'date': [date_today + timedelta(days=(abs(np.random.randn(1))*2)[0]*x) for x in range(ndays)],
'test': pd.Series(np.random.randn(ndays)), 'test2':pd.Series(np.random.randn(ndays))})


df1=pd.DataFrame({'date': [date_today + timedelta(hours=x) for x in range(ndays)],
'test': pd.Series(np.random.randn(ndays)), 'test2':pd.Series(np.random.randn(ndays))})
df2=pd.DataFrame({'date': [date_today + timedelta(days=x)-timedelta(seconds=100*x) for x in range(ndays)],
'test': pd.Series(np.random.randn(ndays)), 'test2':pd.Series(np.random.randn(ndays))})
df=df.append(df1)
df=df.append(df2)


df = df.set_index('date')
df = df.mask(np.random.random(df.shape) < .7)
print(df) # this will be the dataset that I generate for this question

# my orginal data set have labels that is why I convert it to str
df['test']=df['test'].astype(str)
df['test2']=df['test2'].astype(str)
df.replace('nan', np.nan, inplace = True)

如果有人能帮助我,我真的很感激

提前致谢。

最佳答案

使用 fillna 和方法 backfill & limit 2 创建一个填充数据框

filled = df.fillna(method='bfill', limit=2)
# filled outputs:
test test2
date
2018-03-07 16:12:25.944362 -0.484565132221 1.5746340731
2018-03-08 16:12:25.944362 -0.484565132221 1.5746340731
2018-03-09 16:12:25.944362 -0.484565132221 1.5746340731
2018-03-10 16:12:25.944362 -2.65331855926 NaN
2018-03-11 16:12:25.944362 -0.536629362235 NaN
2018-03-12 16:12:25.944362 -0.536629362235 0.725752224799
2018-03-13 16:12:25.944362 -0.536629362235 0.725752224799
2018-03-14 16:12:25.944362 -1.06560298045 0.725752224799
2018-03-15 16:12:25.944362 -1.06560298045 1.54907163337
2018-03-16 16:12:25.944362 -1.06560298045 0.630079822493
2018-03-17 16:12:25.944362 -0.475733492683 0.732271353885
2018-03-18 16:12:25.944362 -0.475733492683 0.732271353885
2018-03-19 16:12:25.944362 NaN -0.642575392433
2018-03-20 16:12:25.944362 NaN -0.178093175312
2018-03-21 16:12:25.944362 NaN -0.57395455941

创建一个 bool 数据框来指示单元格是否已填充

is_filled = df.isnull() & filled.notnull()
# is_filled outputs:
test test2
date
2018-03-07 16:12:25.944362 True True
2018-03-08 16:12:25.944362 True True
2018-03-09 16:12:25.944362 False False
2018-03-10 16:12:25.944362 False False
2018-03-11 16:12:25.944362 True False
2018-03-12 16:12:25.944362 True True
2018-03-13 16:12:25.944362 False True
2018-03-14 16:12:25.944362 True False
2018-03-15 16:12:25.944362 True False
2018-03-16 16:12:25.944362 False False
2018-03-17 16:12:25.944362 True True
2018-03-18 16:12:25.944362 False False
2018-03-19 16:12:25.944362 False False
2018-03-20 16:12:25.944362 False False
2018-03-21 16:12:25.944362 False False

创建掩码以指示需要后缀 _1D_2D 的填充值

one_d = (is_filled & ~is_filled.shift(-1).fillna(False)).applymap(lambda x: '_1D' if x else '')
two_d = (is_filled & is_filled.shift(-1).fillna(False)).applymap(lambda x: '_2D' if x else '')
suffix = pd.concat([one_d, two_d]).groupby('date').agg('max')
# suffix outputs:

test test2
date
2018-03-07 16:12:25.944362 _2D _2D
2018-03-08 16:12:25.944362 _1D _1D
2018-03-09 16:12:25.944362
2018-03-10 16:12:25.944362
2018-03-11 16:12:25.944362 _2D
2018-03-12 16:12:25.944362 _1D _2D
2018-03-13 16:12:25.944362 _1D
2018-03-14 16:12:25.944362 _2D
2018-03-15 16:12:25.944362 _1D
2018-03-16 16:12:25.944362
2018-03-17 16:12:25.944362 _1D _1D
2018-03-18 16:12:25.944362
2018-03-19 16:12:25.944362
2018-03-20 16:12:25.944362
2018-03-21 16:12:25.944362

将后缀数据框连接到填充的数据框将 float 转换为字符串并附加适当的后缀

final = filled.join(suffix, rsuffix='_x')
final.apply(lambda x: '{}{}'.format(x.test, x.test_x), axis=1)
# outputs:
date
2018-03-07 16:12:25.944362 -0.484565132221_2D
2018-03-08 16:12:25.944362 -0.484565132221_1D
2018-03-09 16:12:25.944362 -0.484565132221
2018-03-10 16:12:25.944362 -2.65331855926
2018-03-11 16:12:25.944362 -0.536629362235_2D
2018-03-12 16:12:25.944362 -0.536629362235_1D
2018-03-13 16:12:25.944362 -0.536629362235
2018-03-14 16:12:25.944362 -1.06560298045_2D
2018-03-15 16:12:25.944362 -1.06560298045_1D
2018-03-16 16:12:25.944362 -1.06560298045
2018-03-17 16:12:25.944362 -0.475733492683_1D
2018-03-18 16:12:25.944362 -0.475733492683
2018-03-19 16:12:25.944362 nan
2018-03-20 16:12:25.944362 nan
2018-03-21 16:12:25.944362 nan

类似地,您可以为 test2 生成填充和后缀系列。但是,我建议您将 testtest2 保留为数字类型,并将填充和滞后信息存储在单独的列中(此处为列 suffix将该信息存储在数据框 final 中。

关于python - 使用时间窗口的标签回填缺失数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49158268/

24 4 0