python - 带有分层的 Test_train

python - 带有分层的 Test_train_split

转载作者：行者123 更新时间：2023-12-04 17:38:08

25

4

我正在尝试按数据帧(~188k 行)拆分为训练样本和测试样本。列 ('FLAG') 是我的目标变量，包含值 0 或 1。

由于只有大约 1300 个值为 1 的“FLAG”，我想进行分层拆分以确保两个样本中都有代表性数量的 1 值。

我尝试使用 sklearn 的 train_test_split 函数进行拆分:

train, test = train_test_split(df, test_size=0.2, stratify=df["FLAG"])

我的问题是，生成的训练和测试样本分别有 177942 行和 52 行。我本以为会有 150400 和 37600 行。

我阅读文档 (sklearn.model_selection.train_test_split) 的理解是，我必须提供我的数据框、test_size 和包含目标类的列(即在我的例子中为“FLAG”)。

即使是一个通用的例子:

df = pd.DataFrame(data={'a': np.random.rand(100000), 'b': np.random.rand(100000), 'c': 0})
df.loc[np.random.randint(0, 100000, 1000), 'c'] = 1
tr, ts = train_test_split(df, test_size=.2, stratify=df['c'])
print(tr.shape, ts.shape)

返回:(93105, 3) (38, 3)

我的进口 list :

import cx_Oracle
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

我的python版本:3.7.0sklearn版本:0.20.3 Pandas 版本:0.23.4

最佳答案

我的调查表明该问题是由整数溢出引起的。该问题仅发生在 Python 3.7.x 32 位上。 64 位版本工作正常。

最后我切换到 64 位 Python 来解决这个问题(我之前不得不使用 32 位版本，因为不相关的 Oracle 包依赖)。

关于python - 带有分层的 Test_train_split，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55742246/

25

4

0

文章推荐： android - 动态功能发布构建失败，资源重复

文章推荐： ARM Cortex M3/4 的压缩库

文章推荐：具有部分句子匹配的Django文本搜索

文章推荐： azure-pipelines - 每个项目自动递增发布版本号

python - 带有分层的 Test_train_split
我正在尝试按数据帧(~188k 行)拆分为训练样本和测试样本。列 ('FLAG') 是我的目标变量，包含值 0 或 1。由于只有大约 1300 个值为 1 的“FLAG”，我想进行分层拆分以确保两个
python - 如何为 test_train_split 选择数据框中的数据列和目标列？
我正在尝试使用从 csv 读取到 pandas 数据帧的数据来设置 test_train_split 。我正在读的书说我应该分为 x_train 作为数据和 y_train 作为目标，但是如何定义哪一
python - 用于启动 test_train_split 的数组切片符号？
我正在学习基于 Iris 数据集的机器学习教程: sepal-length sepal-width petal-length petal-width class 0
python - 索引错误 : positional indexers are out-of-bounds stratify sklearn test_train_split
我在 sklearn cross_validation train_test_split 模块中使用 pandas 数据框。 d=pandas.DataFrame({'a':np.random.ran

首页

博学

6Ren·AI

商城

python - 带有分层的 Test_train_split