gpt4 book ai didi

python - 使用 Pandas 进行插补

转载 作者:行者123 更新时间:2023-11-28 22:42:33 24 4
gpt4 key购买 nike

我有一个半小时分辨率的多年时间序列,有一些差距,我想根据其他年份的平均值来估算它们,但同时。例如。如果在 2005 年 1 月 1 日 12:00 缺少一个值,我想同时获取所有值,但从所有其他年份取平均值,然后用平均值估算缺失值。这是我得到的:

import pandas as pd
import numpy as np

idx = pd.date_range('2000-1-1', '2010-1-1', freq='30T')
df = pd.DataFrame({'somedata': np.random.rand(175345)}, index=idx)
df.loc[df['somedata'] > 0.7, 'somedata'] = None

grouped = df.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute]).mean()

这给了我所需的平均值,但我不知道如何将它们插入原始时间序列。

最佳答案

你快到了。只需使用 .tranform填充 NaN

import pandas as pd
import numpy as np

# your data
# ==================================================
np.random.seed(0)
idx = pd.date_range('2000-1-1', '2010-1-1', freq='30T')
df = pd.DataFrame({'somedata': np.random.rand(175345)}, index=idx)
df.loc[df['somedata'] > 0.7, 'somedata'] = np.nan


somedata
2000-01-01 00:00:00 0.5488
2000-01-01 00:30:00 NaN
2000-01-01 01:00:00 0.6028
2000-01-01 01:30:00 0.5449
2000-01-01 02:00:00 0.4237
2000-01-01 02:30:00 0.6459
2000-01-01 03:00:00 0.4376
2000-01-01 03:30:00 NaN
... ...
2009-12-31 20:30:00 0.4983
2009-12-31 21:00:00 0.4282
2009-12-31 21:30:00 NaN
2009-12-31 22:00:00 0.3306
2009-12-31 22:30:00 0.3021
2009-12-31 23:00:00 0.2077
2009-12-31 23:30:00 0.2965
2010-01-01 00:00:00 0.5183

[175345 rows x 1 columns]

# processing
# ==================================================
result = df.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute], as_index=False).transform(lambda g: g.fillna(g.mean()))

somedata
2000-01-01 00:00:00 0.5488
2000-01-01 00:30:00 0.2671
2000-01-01 01:00:00 0.6028
2000-01-01 01:30:00 0.5449
2000-01-01 02:00:00 0.4237
2000-01-01 02:30:00 0.6459
2000-01-01 03:00:00 0.4376
2000-01-01 03:30:00 0.3957
... ...
2009-12-31 20:30:00 0.4983
2009-12-31 21:00:00 0.4282
2009-12-31 21:30:00 0.4784
2009-12-31 22:00:00 0.3306
2009-12-31 22:30:00 0.3021
2009-12-31 23:00:00 0.2077
2009-12-31 23:30:00 0.2965
2010-01-01 00:00:00 0.5183

[175345 rows x 1 columns]

# take a look at a particular sample
# ======================================
x = list(df.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute]))[0][1]

somedata
2000-01-01 0.5488
2001-01-01 0.1637
2002-01-01 0.3245
2003-01-01 NaN
2004-01-01 0.5654
2005-01-01 0.5729
2006-01-01 0.4740
2007-01-01 0.1728
2008-01-01 0.2577
2009-01-01 NaN
2010-01-01 0.5183

x.mean() # output: 0.3998

list(result.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute]))[0][1]

somedata
2000-01-01 0.5488
2001-01-01 0.1637
2002-01-01 0.3245
2003-01-01 0.3998
2004-01-01 0.5654
2005-01-01 0.5729
2006-01-01 0.4740
2007-01-01 0.1728
2008-01-01 0.2577
2009-01-01 0.3998
2010-01-01 0.5183

关于python - 使用 Pandas 进行插补,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31542064/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com