python - 在 scikit-learn 中将 RandomizedSearchCV(或 GridSearcCV)与 LeaveOneGroupOut 交叉验证相结合-6ren

python - 在 scikit-learn 中将 RandomizedSearchCV(或 GridSearcCV)与 LeaveOneGroupOut 交叉验证相结合

转载作者：行者123 更新时间：2023-11-30 09:52:47

我喜欢使用 scikit 的 LOGO(留出一组)作为交叉验证方法，并结合学习曲线。在我处理的大多数情况下，这确实非常有效，但我只能(有效地)使用在这些情况下(根据经验)最关键的两个参数:最大特征和估计器数量。我的代码示例如下:

    Fscorer = make_scorer(f1_score, average = 'micro')
    gp = training_data["GP"].values
    logo = LeaveOneGroupOut()
    from sklearn.ensemble import RandomForestClassifier
    RF_clf100 = RandomForestClassifier (n_estimators=100, n_jobs=-1, random_state = 49)
    RF_clf200 = RandomForestClassifier (n_estimators=200, n_jobs=-1, random_state = 49)
    RF_clf300 = RandomForestClassifier (n_estimators=300, n_jobs=-1, random_state = 49)
    RF_clf400 = RandomForestClassifier (n_estimators=400, n_jobs=-1, random_state = 49)
    RF_clf500 = RandomForestClassifier (n_estimators=500, n_jobs=-1, random_state = 49)
    RF_clf600 = RandomForestClassifier (n_estimators=600, n_jobs=-1, random_state = 49)

    param_name = "max_features"
    param_range = param_range = [5, 10, 15, 20, 25, 30]


    plt.figure()
    plt.suptitle('n_estimators = 100', fontsize=14, fontweight='bold')
    _, test_scores = validation_curve(RF_clf100, X, y, cv=logo.split(X, y, groups=gp),
                                      param_name=param_name, param_range=param_range,
                                      scoring=Fscorer, n_jobs=-1)
    test_scores_mean = np.mean(test_scores, axis=1)
    plt.plot(param_range, test_scores_mean)
    plt.xlabel(param_name)
    plt.xlim(min(param_range), max(param_range))
    plt.ylabel("F1")
    plt.ylim(0.47, 0.57)
    plt.legend(loc="best")
    plt.show()


    plt.figure()
    plt.suptitle('n_estimators = 200', fontsize=14, fontweight='bold')
    _, test_scores = validation_curve(RF_clf200, X, y, cv=logo.split(X, y, groups=gp),
                                      param_name=param_name, param_range=param_range,
                                      scoring=Fscorer, n_jobs=-1)
    test_scores_mean = np.mean(test_scores, axis=1)
    plt.plot(param_range, test_scores_mean)
    plt.xlabel(param_name)
    plt.xlim(min(param_range), max(param_range))
    plt.ylabel("F1")
    plt.ylim(0.47, 0.57)
    plt.legend(loc="best")
    plt.show()
    ...
    ...

我真正想要的是将 LOGO 与网格搜索或随机搜索结合起来，以进行更彻底的参数空间搜索。

到目前为止，我的代码如下所示:

param_dist = {"n_estimators": [100, 200, 300, 400, 500, 600],
              "max_features": sp_randint(5, 30),
              "max_depth": sp_randint(2, 18),
              "criterion": ['entropy', 'gini'],
              "min_samples_leaf": sp_randint(2, 17)}

clf = RandomForestClassifier(random_state = 49)

n_iter_search = 45
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search,
                                   scoring=Fscorer, cv=8,
                                   n_jobs=-1)
random_search.fit(X, y)

当我将 cv = 8 替换为 cv=logo.split(X, y, groups=gp) 时，我收到以下错误消息:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-0092e11ffbf4> in <module>()
---> 35 random_search.fit(X, y)


/Applications/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_search.pyc in fit(self, X, y, groups)
   1183                                           self.n_iter,
   1184                                           random_state=self.random_state)
-> 1185         return self._fit(X, y, groups, sampled_params)

/Applications/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_search.pyc in _fit(self, X, y, groups, parameter_iterable)
    540 
    541         X, y, groups = indexable(X, y, groups)
--> 542         n_splits = cv.get_n_splits(X, y, groups)
    543         if self.verbose > 0 and isinstance(parameter_iterable, Sized):
    544             n_candidates = len(parameter_iterable)

/Applications/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_split.pyc in get_n_splits(self, X, y, groups)
   1489             Returns the number of splitting iterations in the cross-validator.
   1490         """
-> 1491         return len(self.cv)  # Both iterables and old-cv objects support len
   1492 
   1493     def split(self, X=None, y=None, groups=None):

TypeError: object of type 'generator' has no len()

关于(1)发生了什么，更重要的是，(2)我如何让它发挥作用(将 RandomizedSearchCV 与 LeaveOneGroupOut 结合起来)有什么建议吗？

* 2017 年 2 月 8 日更新*

它可以使用cv=logo和@Vivek Kumar的random_search.fit(X, y, wells)建议

最佳答案

您不应该将 logo.split() 传递到 RandomizedSearchCV 中，而只能将 cv 对象(如 logo)传递到其中。 RandomizedSearchCV 内部调用 split() 来生成训练测试索引。您可以将 gp 组传递到 RandomizedSearchCV 或 GridSearchCV 对象的 fit() 调用中。

不要这样做:

random_search.fit(X, y)

这样做:

random_search.fit(X, y, gp)

编辑:您还可以在参数 fit_params 中将 gp 作为字典传递给 GridSearchCV 或 RandomizedSearchCV 的构造函数。

关于python - 在 scikit-learn 中将 RandomizedSearchCV(或 GridSearcCV)与 LeaveOneGroupOut 交叉验证相结合，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41796301/

文章推荐： javascript - 如何知道一组 getJSON() 请求何时完成？

文章推荐： machine-learning - caffe 中的 .net 文件是什么？

文章推荐： python - 如何连接 "Jagged"张量

python - 从 sklearn 尝试 "LeaveOneGroupOut"时收到 python 异常
我是 Scikit-Learn 包的新手，正在尝试使用 LeaveOneGroupOut 交叉验证来完成简单的分类任务。我使用了以下代码，我根据 [this link] 上的文档采用了这些代码来自
python - 在 scikit-learn 中将 RandomizedSearchCV(或 GridSearcCV)与 LeaveOneGroupOut 交叉验证相结合
我喜欢使用 scikit 的 LOGO(留出一组)作为交叉验证方法，并结合学习曲线。在我处理的大多数情况下，这确实非常有效，但我只能(有效地)使用在这些情况下(根据经验)最关键的两个参数:最大特征和估

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 在 scikit-learn 中将 RandomizedSearchCV(或 GridSearcCV)与 LeaveOneGroupOut 交叉验证相结合