gpt4 book ai didi

python - 如何获得可重现但不同的 GroupKFold 实例

转载 作者:太空狗 更新时间:2023-10-29 18:04:24 25 4
gpt4 key购买 nike

GroupKFold 源代码中,random_state 设置为 None

    def __init__(self, n_splits=3):
super(GroupKFold, self).__init__(n_splits, shuffle=False,
random_state=None)

因此,当多次运行时(代码来自 here )

import numpy as np
from sklearn.model_selection import GroupKFold

for i in range(0,10):
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)

print(group_kfold)

for train_index, test_index in group_kfold.split(X, y, groups):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)
print
print

对/对

GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
[7, 8]]), array([[1, 2],
[3, 4]]), array([3, 4]), array([1, 2]))


GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
[7, 8]]), array([[1, 2],
[3, 4]]), array([3, 4]), array([1, 2]))

等...

拆分是相同的。

如何为 GroupKFold 设置一个 random_state 以便在几个不同的交叉验证试验中获得一组不同的(但可重复的)拆分?

例如,我想要

GroupKFold(n_splits=2, random_state=42)
('TRAIN:', array([0, 1]),
'TEST:', array([2, 3]))

('TRAIN:', array([2, 3]),
'TEST:', array([0, 1]))


GroupKFold(n_splits=2, random_state=13)
('TRAIN:', array([0, 2]),
'TEST:', array([1, 3]))

('TRAIN:', array([1, 3]),
'TEST:', array([0, 2]))

到目前为止,似乎一种策略可能是首先使用 sklearn.utils.shuffle,如 post 中所建议的那样.然而,这实际上只是重新排列了每个折叠的元素——它并没有给我们新的拆分。

from sklearn.utils import shuffle
from sklearn.model_selection import GroupKFold
import numpy as np
import sys
import pdb

random_state = int(sys.argv[1])


X = np.arange(20).reshape((10,2))
y = np.arange(10)
groups = np.array([0,0,0,1,2,3,4,5,6,7])

def cv(X, y, groups, random_state):
X_s, y_s, groups_s = shuffle(X,y, groups, random_state=random_state)
cv_out = GroupKFold(n_splits=2)
cv_out_splits = cv_out.split(X_s, y_s, groups_s)
for train, test in cv_out_splits:
print "---"
print X_s[test]
print y_s[test]
print "test groups", groups_s[test]
print "train groups", groups_s[train]
pdb.set_trace()
print "***"
cv(X, y, groups, random_state)

输出:

>python sshuf.py 32

***
---
[[ 2 3]
[ 4 5]
[ 0 1]
[ 8 9]
[12 13]]
[1 2 0 4 6]
test groups [0 0 0 2 4]
train groups [7 6 1 3 5]
---
[[18 19]
[16 17]
[ 6 7]
[10 11]
[14 15]]
[9 8 3 5 7]
test groups [7 6 1 3 5]
train groups [0 0 0 2 4]

>python sshuf.py 234

***
---
[[12 13]
[ 4 5]
[ 0 1]
[ 2 3]
[ 8 9]]
[6 2 0 1 4]
test groups [4 0 0 0 2]
train groups [7 3 1 5 6]
---
[[18 19]
[10 11]
[ 6 7]
[14 15]
[16 17]]
[9 5 3 7 8]
test groups [7 3 1 5 6]
train groups [4 0 0 0 2]

最佳答案

  • KFold 仅在 shuffle=True 时随机化。 Some datasets should not be shuffled.
  • GroupKFold 根本不是随机的。因此 random_state=None
  • GroupShuffleSplit 可能更接近您要查找的内容。

基于组的拆分器的比较:

  • GroupKFold ,测试集形成了所有数据的完整划分。
  • LeavePGroupsOut组合地排除 P 组的所有可能子集; P > 1 时测试集会重叠。因为这意味着 P ** n_groups 完全 split ,通常你需要一个小的 P,并且通常需要 LeaveOneGroupOut这与 GroupKFold 基本相同,k=1
  • GroupShuffleSplit没有说明连续测试集之间的关系;每个训练/测试拆分都是独立执行的。

顺便说一句,Dmytro Lituiev has proposed an alternative GroupShuffleSplit algorithm对于指定的 test_size,这更有助于在测试集中获得正确数量的样本(而不仅仅是正确数量的组)。

关于python - 如何获得可重现但不同的 GroupKFold 实例,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41859613/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com