gpt4 book ai didi

python - numpy中的分层抽样

转载 作者:太空狗 更新时间:2023-10-30 00:15:01 25 4
gpt4 key购买 nike

在 numpy 中,我有一个这样的数据集。前两列是索引。我可以通过索引将我的数据集分成多个 block ,即第一个 block 是 0 0 第二个 block 是 0 1 第三个 block 0 2 然后是 1 0、1 1、1 2 等等。每个 block 至少有两个元素。索引列中的数字可以变化

我需要沿着这些 block 随机拆分数据集 80%-20%,以便在拆分后两个数据集中的每个 block 至少有 1 个元素。我怎么能那样做?

indices | real data
|
0 0 | 43.25 665.32 ... } 1st block
0 0 | 11.234 }
0 1 ... } 2nd block
0 1 }
0 2 } 3rd block
0 2 }
1 0 } 4th block
1 0 }
1 0 }
1 1 ...
1 1
1 2
1 2
2 0
2 0
2 1
2 1
2 1
...

最佳答案

看看你觉得怎么样。为了引入随机性,我正在洗牌整个数据集。这是我想出如何进行矢量化拆分的唯一方法。也许你可以简单地打乱一个索引数组,但对于我今天的大脑来说,这是一个太多的间接问题。我还使用了结构化数组,以便于提取 block 。首先,让我们创建一个示例数据集:

from __future__ import division
import numpy as np

# Create a sample data set
c1, c2 = 10, 5
idx1, idx2 = np.arange(c1), np.arange(c2)
idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1)

items = 1000
i = np.random.randint(c1*c2, size=(items - 2*c1*c2,))
d = np.random.rand(items+5)

dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int),
('data', np.float)])
dataset['idx1'][:2*c1*c2] = np.tile(idx1, 2)
dataset['idx1'][2*c1*c2:-5] = idx1[i]
dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2)
dataset['idx2'][2*c1*c2:-5] = idx2[i]
dataset['data'] = d
# Add blocks with only 2 and only 3 elements to test corner case
dataset['idx1'][-5:] = -1
dataset['idx2'][-5:] = [0] * 2 + [1]*3

现在是分层抽样:

# For randomness, shuffle the entire array
np.random.shuffle(dataset)

blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=True)
block_count = np.bincount(_)
where = np.argsort(_)
block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))

# If we have n elements in a block, and we assign 1 to each array, we
# are left with only n-2. If we randomly assign a fraction x of these
# to the first array, the expected ratio of items will be
# (x*(n-2) + 1) : ((1-x)*(n-2) + 1)
# Setting the ratio equal to 4 (80/20) and solving for x, we get
# x = 4/5 + 3/5/(n-2)

x = 4/5 + 3/5/(block_count - 2)
x = np.clip(x, 0, 1) # if n in (2, 3), the ratio is larger than 1
threshold = np.repeat(x, block_count)
threshold[block_start] = 1 # first item goes to A
threshold[block_start + 1] = 0 # seconf item goes to B

a_idx = threshold > np.random.rand(len(dataset))

A = dataset[where[a_idx]]
B = dataset[where[~a_idx]]

运行后,分割大致为 80/20,所有 block 都在两个数组中表示:

>>> len(A)
815
>>> len(B)
190
>>> np.all(np.unique(A[['idx1', 'idx2']]) == np.unique(B[['idx1', 'idx2']]))
True

关于python - numpy中的分层抽样,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15838733/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com