gpt4 book ai didi

python - 我什么时候应该在 StratifiedKFold 中洗牌

转载 作者:行者123 更新时间:2023-11-30 08:47:52 28 4
gpt4 key购买 nike

我读过一些关于各种简历方法的帖子。但我不明白的是,为什么在函数中打乱数据会导致准确性显着提高,以及何时这样做是正确的。

在我的时间序列数据集中,大小为 921 *10080其中每行是一个区域中特定位置的水温的时间序列,最后 2 列是具有 2 个组的标签,即高风险(水中细菌含量高)和低风险(水中细菌含量低),根据我是否设置 "shuffle=True"(achieved accuracy of around 75%),准确度差异很大。 ,与 accuracy of 50%设置"shuffle=False"时在StratifiedKFold如下图:

n_folds = 5
skf = StratifiedKFold(n_splits=n_folds, shuffle=True)

sklearn 文档说明如下:

A note on shuffling

If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not independently and identically distributed. For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.

Some cross validation iterators, such as KFold, have an inbuilt option to shuffle the data indices before splitting them. Note that:

• This consumes less memory than shuffling the data directly.

• By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying cv=some_integer to cross_val_score, grid search, etc. Keep in mind that train_test_split still returns a random split.

• The random_state parameter defaults to None, meaning that the shuffling will be different every time KFold(..., shuffle=True) is iterated. However, GridSearchCV will use the same shuffling for each set of parameters validated by a single call to its fit method.

• To get identical results for each split, set random_state to an integer.

我不确定我是否正确解释了文档 - 非常感谢您的解释。另外,我还有几个问题:

1)为什么shuffle后准确率有这么大的提升?我是否过度拟合?我什么时候应该洗牌?

2)鉴于所有样本都是从同一区域采集的,它们可能不是独立的。这对洗牌有何影响?洗牌还有效吗?

3) 洗牌是否会将标签与其相应的 X 分开数据? (答案更新:否。改组不会将标签与其相应的 X 数据分开)

谢谢

最佳答案

在处理时间序列数据时,您是正确的,洗牌会提高准确性。原因是,对训练集进行混洗会导致其中包含与测试集中的样本非常相似的样本。

例如,如果您在 2010-2019 年训练了一个模型,然后对 2020 年进行预测,则所有测试集样本将在时间上与训练期分开,因此不会泄漏信息。现在假设 2020 年发生了一次极端事件,您对数据进行了洗牌。训练集现在将包含来自某些传感器的极端事件的样本,然后在测试集中它将学习预测该期间其他传感器的类似标签。这是训练集和测试集之间的信息泄漏。

关于python - 我什么时候应该在 StratifiedKFold 中洗牌,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59619291/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com