python 1 :1 stratified sampling per each group-6ren

python 1 :1 stratified sampling per each group

转载作者：行者123 更新时间：2023-12-01 08:20:58

25

4

如何在Python中进行1:1分层采样？

假设 Pandas Dataframe df 严重不平衡。它包含一个二元组和多列分类子组。

df = pd.DataFrame({'id':[1,2,3,4,5], 'group':[0,1,0,1,0], 'sub_category_1':[1,2,2,1,1], 'sub_category_2':[1,2,2,1,1], 'value':[1,2,3,1,2]})
display(df)
display(df[df.group == 1])
display(df[df.group == 0])
df.group.value_counts()

对于主 group==1 的每个成员，我需要找到 group==0 的单个匹配项。

来自 scikit-learn 的 StratifiedShuffleSplit 只会返回数据的随机部分，而不是 1:1 匹配。

最佳答案

如果我理解正确，你可以使用 np.random.permutation :

import numpy as np
import pandas as pd

np.random.seed(42)

df = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'group': [0, 1, 0, 1, 0], 'sub_category_1': [1, 2, 2, 1, 1],
                   'sub_category_2': [1, 2, 2, 1, 1], 'value': [1, 2, 3, 1, 2]})

# create new column with an identifier for a combination of categories
columns = ['sub_category_1', 'sub_category_2']
labels = df.loc[:, columns].apply(lambda x: ''.join(map(str, x.values)), axis=1)
values, keys = pd.factorize(labels)
df['label'] = labels.map(dict(zip(keys, values)))

# build distribution of sub-categories combinations
distribution = df[df.group == 1].label.value_counts().to_dict()

# select from group 0 only those rows that are in the same sub-categories combinations
mask = (df.group == 0) & (df.label.isin(distribution))

# do random sampling
selected = np.ravel([np.random.permutation(group.index)[:distribution[name]] for name, group in df.loc[mask].groupby(['label'])])

# display result
result = df.drop('label', axis=1).iloc[selected]
print(result)

输出

   group  id  sub_category_1  sub_category_2  value
4      0   5               1               1      2
2      0   3               2               2      3

请注意，此解决方案假设组 1 的每个可能的 sub_category 组合的大小小于组 0 中相应子组的大小。更强大的版本涉及使用 np.random.choice更换:

selected = np.ravel([np.random.choice(group.index, distribution[name], replace=True) for name, group in df.loc[mask].groupby(['label'])])

带有选择的版本与带有排列的版本没有相同的假设，尽管它要求每个子类别组合至少有一个元素。

关于 python 1 :1 stratified sampling per each group，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54653317/

25

4

0

文章推荐： jquery - 如何从不同域上的 API 请求 JSON 数据？

文章推荐： wcf - 企业级数据契约(Contract)是一种好的做法吗？

文章推荐： jQuery 动画页面滚动，带有固定标题的偏移量，闪烁褪色的内容

文章推荐： boo - 如何在 Boo 中添加语言关键字以让 'when' 表现为 'if'

ruby-on-rails - 使新哈希从 {"sample"=> "sample"} 到 { :sample => "sample"}
In condition, COLUMN = [:id, :tag_list, :price, :url, :Perweight, :Totalweight, :memo, :created_at,
python - 使用总体样本的分类器 : scaling the population and then sampling/scaling the sample/scaling the X_TRAIN split of the sample?
我正在构建一个逻辑回归分类器。我从一组 500.000 条记录开始，我只想使用其中的一个样本。你有什么建议: 1) 缩放总体然后采样2)缩放样本3) 仅缩放样本的 X_TRAIN 分割？为什么？
python - 值错误 : Input arrays should have the same number of samples as target arrays. Found 1600 input samples and 6400 target samples
我正在尝试进行 8 级分类。这是代码: import keras import numpy as np from keras.preprocessing.image import ImageDataG
python - 在Keras中创建 "sample by sample"模型
我想在 Keras 中创建一个可以“逐个样本”学习的模型；这种机器叫online learning ，一个逐个接收和拟合数据的模型。我的问题是:我怎样才能在 Keras 中做到这一点？是否可以通过在拟
php - Codeigniter:this->datatables->select(sample)->from(sample)->where()
请帮帮我。我无法正确使用我的数据表。我想做的是从表中选择并使用where函数。但我做不到。这是我的 Controller 代码 public function reporttable ()
opencv - 对于汽车检测，阴性 sample 的大小应与阳性 sample 的大小相同吗？
我将所有正样本的大小调整为相同的大小，因此负样本的大小也应与正样本的大小相同。最佳答案通常，通过对象检测，您可以在图像上滑动固定大小的搜索窗口，从而产生特征响应。然后，分类器将响应与经过训练的模型
python - "sample larger than population"in random.sample python
为自己创建一个简单的通行证生成器，我注意到如果我希望我的人口只有数字(0-9)，总共有 10 个选项，如果我希望我的长度超过 10，它不会使用更多的数字然后一次并返回“样本大于总体”错误。是否可以维
multidimensional-array - 批量标准化: fixed samples or different samples by dimension?
当我读到一篇论文“批量归一化:通过减少内部协变量偏移来加速深度网络训练”时，我想到了一些问题。论文中写道: Since m examples from training data can estim
python : How to use random sample when we don't need duplicates random sample
我的代码 import random MyList = [[1,2,3,4,5,6,7,8],[a,s,d,f,g,h,h],[q,w,e,r,t,y]] MyListRandom = [] rand
python - 值错误 : Sample larger than population selecting samples from graph
我正在尝试从图中随机选择 n 个样本。为此，我使用 random.sample 函数创建了一个名为 X 的列表，如下所示: X= random.sample(range(graph.ecount())
JMeter:在哪种情况下，我可以在响应断言中将 "Main sample"或 "Sub Sample"或同时用于文本响应
我想知道在哪种情况下我可以将“主样本”或“子样本”或同时用于“响应断言”中的“文本响应”。我用谷歌搜索，但尚未收到满意的答案。帮助表示赞赏。最佳答案根据JMeter帮助， This is fo
hadoop - Rumen 的 sample 输出或 Gridmix 的 sample 输入
我对使用 Hadoop 等大数据工具还很陌生。我想在 Yarn/或 Yarn Simulator 上执行公开可用的集群跟踪 ( https://github.com/google/cluster-da
android - 银河连结 : Sensor Sampling Rate becomes faster when sampling more Sensors
我正在尝试从 Samsung Galaxy Nexus(Android 4.0)中尽可能快地读出传感器值。为此，我使用不同的传感器和采样率做了一些实验，并发现了一个非常奇怪的行为。当我仅使用 Acc-
r - Sample.int(m, k) 中的错误 : cannot take a sample larger than the population
首先，我要说的是，我对机器学习、kmeans 和 r 相当陌生，这个项目是一种了解更多相关知识的方法，也是向我们的 CIO 展示这些数据的方法，以便我可以在开发新的帮助台系统。我有一个 60K 行的
python - Django 查询集上的 random.sample : How will sampling on querysets affect performance?
我试图从我的查询集中抽取一些记录来提高性能，例如: from random import sample from my_app import MyModel my_models = MyModel.o
c - : type_a sample; type_b *sample_b = (type_b *) ((void*) &sample); 中的无关(void *)
我正在阅读此主题:Typecasting variable with another typedef type_b *sample_b = (type_b *) ((void *) &sample);
bioinformatics - Snakemake 和 Pandas 语法 : Getting sample specific parameters from the sample table
首先，这可能是 Snakemake and pandas syntax 的副本.但是，我仍然很困惑，所以我想再解释一下。在 Snakemake 中，我加载了一个包含多列的示例表。其中一列称为“Rea
python - random.sample(sample,k) 和 itertools.combinations(p,r) 之间的区别
你好，我是 python 新手，刚刚开始编写基本的 python 脚本。我决定编写一个密码生成器程序。我遇到了 random.sample() 和 itertools.combinations() 函
javascript - 使用 module.exports = new Sample 与 module.exports = Sample 导出对象
假设一个文件有很多原型(prototype)和函数对象声明代码: function Sample() { ... } Sample.prototype.method1 = () => { ..
iphone - 如何将caf High quality(sample rate)改成caf Low quality(sample rate)
我正在使用 AVAudioRecorder。我以 44100 采样率以 caf 格式录制音频。就记录成功了。录制后，我想转换已录制的 caf 采样率为 11025 和 22050 的音频文件。是否可

首页

博学

6Ren·AI

商城

python 1 :1 stratified sampling per each group