python - 我什么时候应该在 StratifiedKFold 中洗牌-6ren

python - 我什么时候应该在 StratifiedKFold 中洗牌

转载作者：行者123 更新时间：2023-11-30 08:47:52

28

4

我读过一些关于各种简历方法的帖子。但我不明白的是，为什么在函数中打乱数据会导致准确性显着提高，以及何时这样做是正确的。

在我的时间序列数据集中，大小为 921 *10080其中每行是一个区域中特定位置的水温的时间序列，最后 2 列是具有 2 个组的标签，即高风险(水中细菌含量高)和低风险(水中细菌含量低)，根据我是否设置 "shuffle=True"(achieved accuracy of around 75%)，准确度差异很大。，与 accuracy of 50%设置"shuffle=False"时在StratifiedKFold如下图:

n_folds = 5
skf = StratifiedKFold(n_splits=n_folds, shuffle=True)

sklearn 文档说明如下:

A note on shufﬂing

If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shufﬂing it ﬁrst may be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not independently and identically distributed. For example, if samples correspond to news articles, and are ordered by their time of publication, then shufﬂing the data will likely lead to a model that is overﬁt and an inﬂated validation score: it will be tested on samples that are artiﬁcially similar (close in time) to training samples.

Some cross validation iterators, such as KFold, have an inbuilt option to shufﬂe the data indices before splitting them. Note that:

• This consumes less memory than shufﬂing the data directly.

• By default no shufﬂing occurs, including for the (stratiﬁed) K fold cross- validation performed by specifying cv=some_integer to cross_val_score, grid search, etc. Keep in mind that train_test_split still returns a random split.

• The random_state parameter defaults to None, meaning that the shufﬂing will be different every time KFold(..., shuffle=True) is iterated. However, GridSearchCV will use the same shufﬂing for each set of parameters validated by a single call to its fit method.

• To get identical results for each split, set random_state to an integer.

我不确定我是否正确解释了文档 - 非常感谢您的解释。另外，我还有几个问题:

1)为什么shuffle后准确率有这么大的提升？我是否过度拟合？我什么时候应该洗牌？

2)鉴于所有样本都是从同一区域采集的，它们可能不是独立的。这对洗牌有何影响？洗牌还有效吗？

3) 洗牌是否会将标签与其相应的 X 分开数据？ (答案更新:否。改组不会将标签与其相应的 X 数据分开)

谢谢

最佳答案

在处理时间序列数据时，您是正确的，洗牌会提高准确性。原因是，对训练集进行混洗会导致其中包含与测试集中的样本非常相似的样本。

例如，如果您在 2010-2019 年训练了一个模型，然后对 2020 年进行预测，则所有测试集样本将在时间上与训练期分开，因此不会泄漏信息。现在假设 2020 年发生了一次极端事件，您对数据进行了洗牌。训练集现在将包含来自某些传感器的极端事件的样本，然后在测试集中它将学习预测该期间其他传感器的类似标签。这是训练集和测试集之间的信息泄漏。

关于python - 我什么时候应该在 StratifiedKFold 中洗牌，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59619291/

28

4

0

文章推荐： python-3.x - Keras.fit_generator 需要更多时间用于纪元

文章推荐： python - 用于视频分类的 CNN LSTM keras

文章推荐： tensorflow - 目标检测模型陷入低 mAP

elasticsearch - 应该+ ElasticSearch中的distance_function
我正在尝试在Elasticsearch中返回的值中考虑地理位置的接近性。我希望近距离比某些字段(例如legal_name)重要，但比其他字段重要。从文档看来，当前的方法是使用distance_fea
php - 在Elasticsearch中处理必须/应该
我是Elasticsearch的初学者，今天在进行“多与或”查询时遇到问题。我有一个SQL查询，需要在Elastic中进行转换: WHERE host_id = 999 AND psh_pid =
c++ - 应该/可以在函数中通过引用传递智能指针
智能指针应该/可以在函数中通过引用传递吗？即: void foo(const std::weak_ptr& x) 最佳答案当然你可以通过const&传递一个智能指针。这样做也是有原因的: 如果接
elasticsearch - '应该' bool 查询获取不需要的结果
我想执行与以下MYSQL查询等效的查询 SELECT http_user, http_req_method, dst dst_port count(*) as total FROM my_table
Elasticsearch:应该 + minimum_should_match 与必须
我用这两个查询进行测试用must查询 { "size": 200, "from": 0, "query": { "bool": { "must": [ { "mat
android - 我如何(应该)将处理程序添加到服务中的线程
我仍在研究 Pro Android 2 的简短服务示例(第 304 页)同样，服务示例由两个类组成:如下所示的 BackgroundService.java 和如下所示的 MainActivity.j
html - 当引入水平滚动时，*应该*如何呈现此内容？
给定标记 like this : header really_wide_table..........................................
javascript - ChaiJS 应该 - 测试空字符串
根据 shouldJS 上的文档网站我应该能够做到这一点: ''.should.be.empty(); ChaiJS网站没有使用 should 语法的示例，但它列出了 expect 并且上面的示例似乎
c - 必须(应该)避免使用标准库中的哪些函数？
我在 Stack Overflow 上读到一些 C 函数是“过时的”或“应该避免”。你能给我一些这种功能的例子以及原因吗？这些功能有哪些替代方案？我们可以安全地使用它们 - 有什么好的做法吗？最
c++11 - 省略号可以/应该/将适用于元组吗？
在 C++11 中，可变参数模板允许使用任意数量的参数和省略号运算符 ... 调用函数。允许该可变参数函数对每个参数做一些事情，即使每个参数的事情不是一样的: template void dummy(
ruby-on-rails - 应该:测试validates_presence_of:on =>:update
我在我从事的项目之一上将Shoulda与Test::Unit结合使用。我遇到的问题是我最近更改了此设置: class MyModel :update end 以前，我的(通过)测试看起来像这样: c
chai - 如何在 chai 中做一个 "or"应该
我该如何做 or使用 chai.should 进行测试? 例如就像是 total.should.equal(4).or.equal(5) 或者 total.should.equal.any(4,5)
Mercurial - .hgtags 应该 merge 吗？
如果您要将存储库 B 中的更改 merge 到存储库 A 中，是否应该 merge .hgtags 中的更改？存储库 B 可能具有 A 中没有的标签 1.01、1.02、1.03。为什么要将这些 m
elasticsearch - 带有Must(and)应该(或)不产生期望结果的Elasticsearch查询
我正在尝试执行X AND(y OR z)的查询我需要获得该代理为上市代理或卖方的所有已售属性(property)。我只用 bool(boolean) 值就可以得到9324个结果。当我添加 bool
javascript - Mocha/应该 'undefined is not a function'
我要离开 this教程，尝试使用 Mocha、Supertest 和 Should.js 进行测试。我有以下基本测试来通过 PUT 创建用户接受 header 中数据的端点。 describe('U
java - JUnit:可以(应该)这样做吗？
我正在尝试为 Web 应用程序编写一些 UI 测试，但有一些复杂的问题希望您能帮助我解决。首先，该应用程序有两种模式。其中一种模式是“训练”，另一种是“现场”。在实时模式下，数据直接从我们的数据库中
ruby-on-rails - 应该 helper 不工作
我有一个规范: require 'spec_helper' # hmm... I need to include it here because if I include it inside desc
ruby-on-rails - 行动有效，但测试无效(应该)
我正在尝试用这个测试我在 Rails 中的更新操作: context "on PUT to :update" do setup do @countdown = Factory(:count
html - 应该 &'s be escaped in onclick="...”？
我还没有找到合适的答案: onclick="..." 中是否应该转义 &(& 符号)？ (或者就此而言，在每个 HTML 属性中？) 我已经尝试在 jsFiddle 和 W3C 的验证器上运行转义和非
java - 应该 move 球的程序，但不执行方法运行
import java.applet.*; import java.awt.*; import java.awt.event.*; public class Main extends Applet i

首页

博学

6Ren·AI

商城

python - 我什么时候应该在 StratifiedKFold 中洗牌