gpt4 book ai didi

Python Scikit - 学习 : Cross Validation with multi-index

转载 作者:太空宇宙 更新时间:2023-11-04 02:11:50 24 4
gpt4 key购买 nike

您好,我想使用 scikit learn 的功能之一进行交叉验证。我想要的是折叠的拆分由其中一个索引确定。例如,假设我有这个数据,其中“月”和“日”是索引:

Month    Day   Feature_1 
January 1 10
2 20
February 1 30
2 40
March 1 50
2 60
3 70
April 1 80
2 90

假设我想将 1/4 的数据作为每次验证的测试集。我希望此折叠分隔由第一个索引(即月份)完成。在这种情况下,测试集将是月份之一,其余 3 个月将是训练集。例如,训练和测试拆分之一将如下所示:

TEST SET:
Month Day Feature_1
January 1 10
2 20

TRAINING SET:
Month Day Feature_1
February 1 30
2 40
March 1 50
2 60
3 70
April 1 80
2 90

我该怎么做。谢谢。

最佳答案

这称为按组拆分。查看user-guide in scikit-learn here to understand more about it :

...

To measure this, we need to ensure that all the samples in thevalidation fold come from groups that are not represented at all inthe paired training fold.

...

您可以使用 GroupKFold或名称中有 Group 的其他策略。一个样本可以是

# I am not sure about this exact command, 
# but after this, you should have individual columns for each index
df = df.reset_index()

print(df)
Month Day Feature_1
January 1 10
January 2 20
February 1 30
February 2 40
March 1 50
March 2 60
March 3 70

groups = df['Month']

from sklearn.model_selection import GroupKFold

gkf = GroupKFold(n_splits=3)
for train, test in gkf.split(X, y, groups=groups):
# Here "train", "test" are indices of location,
# you need to use "iloc" to get actual values
print("%s %s" % (train, test))

print(df.iloc[train, :])
print(df.iloc[test, :])

更新:为了将其传递到交叉验证方法中,只需将月份数据传递到其中的groups 参数。如下所示:

gkf = GroupKFold(n_splits=3)
y_pred = cross_val_predict(estimator, X_train, y_train, cv=gkf, groups=df['Month'])

关于Python Scikit - 学习 : Cross Validation with multi-index,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53591919/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com