gpt4 book ai didi

python - Pandas groupby 训练/验证拆分

转载 作者:行者123 更新时间:2023-12-04 15:28:20 25 4
gpt4 key购买 nike

我有一个每日温度数据集,我正在尝试构建一个模型,一次处理一周的数据。我已将它导入 pandas DataFrame 并按周对其进行分组(使用 resample 方法)。到目前为止一切顺利。

请注意,我不想聚合每周数据,我只想将我的“平面”数据集分组为每周“ block ”,这样我可以一次将一个数据输入模型。

我可以用下面的代码完成它,但我的问题是:

如何将这个分组的 DataFrame 拆分为训练/验证集?

这是我到目前为止尝试过的(大部分都失败了):

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

daily = pd.DataFrame(
data=np.random.rand(365) * 120, columns=["temp"],
index=pd.date_range(start="2019-01-01", end="2019-12-31", freq="d")
)
print("days:", len(daily))

weekly = daily.resample("W")
print("weeks:", len(weekly))

mask = np.random.rand(len(weekly)) < .8
# Both of these give KeyError: 'Columns not found: False, True'
train = weekly[mask]
valid = weekly[~mask]

# This also fails with KeyError: 'Columns not found: 12'
train, valid = train_test_split(weekly, train_size=.8)

更新:

与此同时,我想出了一对可用于训练/验证的生成器:

def gen_train(df, mask):
for index, (_, data) in enumerate(df):
if mask[index]: yield data

def gen_valid(df, mask):
for index, (_, data) in enumerate(df):
if not mask[index]: yield data

mask = np.random.rand(len(weekly)) < .8

model.fit(x=gen_train(weekly, mask), validation_data=get_valid(weekly, mask),
...
)

不幸的是,这不会打乱数据。

谁能想出更好的解决方案?

最佳答案

您的问题是您没有完成 resample 方法。选择一种重新采样的方法,您的代码就可以工作了:

...
weekly = daily.resample("W").mean() # <- Note the call to complete the resample with weekly mean
train, valid = train_test_split(weekly, train_size=.8)

train.shape
# (42, 1)

valid.shape
# (11, 1)

42 / (42 + 11)
# 0.7924528301886793

编辑:如果您不想重新采样,只需使用 groupby 循环数周:

...
for date, week in daily.groupby(pd.Grouper(freq='W')):
train, valid = train_test_split(week, train_size=.8)
print(date)
print(train.shape)
print(valid.shape)

2019-01-06 00:00:00
(4, 1)
(2, 1)
2019-01-13 00:00:00
(5, 1)
(2, 1)
2019-01-20 00:00:00
(5, 1)
(2, 1)
2019-01-27 00:00:00
(5, 1)
(2, 1)
2019-02-03 00:00:00
(5, 1)
(2, 1)
...

编辑:如果您想抽样周作为观察单位,您需要为它们创建一个新列:

daily['week'] = daily.index.year.astype(str) + '-' + daily.index.week.astype(str)

temp week
2019-01-01 98.551345 2019-1
2019-01-02 103.880149 2019-1
2019-01-03 48.187819 2019-1
2019-01-04 116.942540 2019-1
2019-01-05 21.342152 2019-1
... ... ...

然后训练/测试拆分周并选择行:

train_weeks, test_weeks = train_test_split(daily.week.unique(), train_size=.8)
train = daily[daily.week.isin(train_weeks)]
test = daily[daily.week.isin(test_weeks)]

train.shape
#(288, 2)

test.shape
#(77, 2)

关于python - Pandas groupby 训练/验证拆分,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61804274/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com