python - random.sample 中使用的常数的证明-6ren

python - random.sample 中使用的常数的证明

转载作者：行者123 更新时间：2023-11-28 17:08:24

25

4

我正在研究 random.py(python 标准库)中函数示例的源代码。

这个想法很简单:

如果需要大量样本 (n) 中的小样本 (k):只需选择 k 个随机指标，因为您不太可能选择相同的指标数量是人口的两倍。如果您这样做了，只需重新选择。
如果需要相对较大的样本 (k)，与总人口 (n) 相比:最好记录您选择的样本。

我的问题

涉及到几个常量，setsize = 21 和 setsize += 4 ** _log(3*k,4)。临界比率大致为 k : 21+3k。评论说 # size of a small set minus size of an empty list 和 # table size for big sets。

这些具体数字从何而来？有什么理由？

这些评论提供了一些启示，但我发现他们带来的问题与他们回答的问题一样多。

我会有点理解，一个小集合的大小，但发现“减去一个空列表的大小”令人困惑。有人可以阐明这一点吗？
相对于“设置大小”而言，“表格”大小的具体含义是什么。

查看github存储库，看起来很老的版本只是简单地使用比例k : 6*k作为临界比例，但我觉得这同样神秘。

代码

def sample(self, population, k):
    """Chooses k unique random elements from a population sequence or set.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    To choose a sample in a range of integers, use range as an argument.
    This is especially fast and space efficient for sampling from a
    large population:   sample(range(10000000), 60)
    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    if isinstance(population, _Set):
        population = tuple(population)
    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence or set.  For dicts, use list(d).")
    randbelow = self._randbelow
    n = len(population)
    if not 0 <= k <= n:
        raise ValueError("Sample larger than population or is negative")
    result = [None] * k
    setsize = 21        # size of a small set minus size of an empty list
    if k > 5:
        setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
    if n <= setsize:
        # An n-length list is smaller than a k-length set
        pool = list(population)
        for i in range(k):         # invariant:  non-selected at [0,n-i)
            j = randbelow(n-i)
            result[i] = pool[j]
            pool[j] = pool[n-i-1]   # move non-selected item into vacancy
    else:
        selected = set()
        selected_add = selected.add
        for i in range(k):
            j = randbelow(n)
            while j in selected:
                j = randbelow(n)
            selected_add(j)
            result[i] = population[j]
    return result

(我很抱歉，这个问题最好放在 math.stackexchange 中。我想不出这个特定比率的任何概率/统计原因，评论听起来好像，这可能与设置和列表使用的空间量 - 但无法在任何地方找到任何详细信息)。

最佳答案

此代码试图确定使用列表还是集合会占用更多空间(而不是出于某种原因试图估计时间成本)。

看起来 21 是空列表的大小与确定此常量的 Python 版本上的小集合之间的差异，以指针大小的倍数表示。我没有那个版本的 Python 的构建，但是在我的 64 位 CPython 3.6.3 上测试给出了 20 个指针大小的差异:

>>> sys.getsizeof(set()) - sys.getsizeof([])
160

并比较 3.6.3 list和 set list 的结构定义和 set来自 change 的定义引入了这段代码，21 似乎是合理的。

我说“空列表的大小与小集的区别”是因为现在和当时，小集都使用包含在集合结构本身内部而不是外部的哈希表分配:

setentry smalltable[PySet_MINSIZE];

if k > 5:
    setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets

check 添加为大于 5 个元素的集合分配的外部表的大小，大小再次以指针数表示。此计算假设集合永远不会收缩，因为采样算法永远不会删除元素。我目前不确定这个计算是否准确。

最后，

if n <= setsize:

将集合的基本开销加上外部哈希表使用的任何空间与输入元素列表所需的 n 指针进行比较。 (它似乎没有考虑 list(population) 执行的过度分配，因此它可能低估了列表的成本。)

关于python - random.sample 中使用的常数的证明，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49496794/

25

4

0

文章推荐： css - 更改占位符文本颜色

文章推荐： python - Paypal 为 IPN 发送了警告邮件

文章推荐： python - 子函数的自定义打印函数 yield

文章推荐： python - xlsxwriter 中是否有可让您对列进行排序的函数？

ruby-on-rails - 使新哈希从 {"sample"=> "sample"} 到 { :sample => "sample"}
In condition, COLUMN = [:id, :tag_list, :price, :url, :Perweight, :Totalweight, :memo, :created_at,
python - 使用总体样本的分类器 : scaling the population and then sampling/scaling the sample/scaling the X_TRAIN split of the sample?
我正在构建一个逻辑回归分类器。我从一组 500.000 条记录开始，我只想使用其中的一个样本。你有什么建议: 1) 缩放总体然后采样2)缩放样本3) 仅缩放样本的 X_TRAIN 分割？为什么？
python - 值错误 : Input arrays should have the same number of samples as target arrays. Found 1600 input samples and 6400 target samples
我正在尝试进行 8 级分类。这是代码: import keras import numpy as np from keras.preprocessing.image import ImageDataG
python - 在Keras中创建 "sample by sample"模型
我想在 Keras 中创建一个可以“逐个样本”学习的模型；这种机器叫online learning ，一个逐个接收和拟合数据的模型。我的问题是:我怎样才能在 Keras 中做到这一点？是否可以通过在拟
php - Codeigniter:this->datatables->select(sample)->from(sample)->where()
请帮帮我。我无法正确使用我的数据表。我想做的是从表中选择并使用where函数。但我做不到。这是我的 Controller 代码 public function reporttable ()
opencv - 对于汽车检测，阴性 sample 的大小应与阳性 sample 的大小相同吗？
我将所有正样本的大小调整为相同的大小，因此负样本的大小也应与正样本的大小相同。最佳答案通常，通过对象检测，您可以在图像上滑动固定大小的搜索窗口，从而产生特征响应。然后，分类器将响应与经过训练的模型
python - "sample larger than population"in random.sample python
为自己创建一个简单的通行证生成器，我注意到如果我希望我的人口只有数字(0-9)，总共有 10 个选项，如果我希望我的长度超过 10，它不会使用更多的数字然后一次并返回“样本大于总体”错误。是否可以维
multidimensional-array - 批量标准化: fixed samples or different samples by dimension?
当我读到一篇论文“批量归一化:通过减少内部协变量偏移来加速深度网络训练”时，我想到了一些问题。论文中写道: Since m examples from training data can estim
python : How to use random sample when we don't need duplicates random sample
我的代码 import random MyList = [[1,2,3,4,5,6,7,8],[a,s,d,f,g,h,h],[q,w,e,r,t,y]] MyListRandom = [] rand
python - 值错误 : Sample larger than population selecting samples from graph
我正在尝试从图中随机选择 n 个样本。为此，我使用 random.sample 函数创建了一个名为 X 的列表，如下所示: X= random.sample(range(graph.ecount())
JMeter:在哪种情况下，我可以在响应断言中将 "Main sample"或 "Sub Sample"或同时用于文本响应
我想知道在哪种情况下我可以将“主样本”或“子样本”或同时用于“响应断言”中的“文本响应”。我用谷歌搜索，但尚未收到满意的答案。帮助表示赞赏。最佳答案根据JMeter帮助， This is fo
hadoop - Rumen 的 sample 输出或 Gridmix 的 sample 输入
我对使用 Hadoop 等大数据工具还很陌生。我想在 Yarn/或 Yarn Simulator 上执行公开可用的集群跟踪 ( https://github.com/google/cluster-da
android - 银河连结 : Sensor Sampling Rate becomes faster when sampling more Sensors
我正在尝试从 Samsung Galaxy Nexus(Android 4.0)中尽可能快地读出传感器值。为此，我使用不同的传感器和采样率做了一些实验，并发现了一个非常奇怪的行为。当我仅使用 Acc-
r - Sample.int(m, k) 中的错误 : cannot take a sample larger than the population
首先，我要说的是，我对机器学习、kmeans 和 r 相当陌生，这个项目是一种了解更多相关知识的方法，也是向我们的 CIO 展示这些数据的方法，以便我可以在开发新的帮助台系统。我有一个 60K 行的
python - Django 查询集上的 random.sample : How will sampling on querysets affect performance?
我试图从我的查询集中抽取一些记录来提高性能，例如: from random import sample from my_app import MyModel my_models = MyModel.o
c - : type_a sample; type_b *sample_b = (type_b *) ((void*) &sample); 中的无关(void *)
我正在阅读此主题:Typecasting variable with another typedef type_b *sample_b = (type_b *) ((void *) &sample);
bioinformatics - Snakemake 和 Pandas 语法 : Getting sample specific parameters from the sample table
首先，这可能是 Snakemake and pandas syntax 的副本.但是，我仍然很困惑，所以我想再解释一下。在 Snakemake 中，我加载了一个包含多列的示例表。其中一列称为“Rea
python - random.sample(sample,k) 和 itertools.combinations(p,r) 之间的区别
你好，我是 python 新手，刚刚开始编写基本的 python 脚本。我决定编写一个密码生成器程序。我遇到了 random.sample() 和 itertools.combinations() 函
javascript - 使用 module.exports = new Sample 与 module.exports = Sample 导出对象
假设一个文件有很多原型(prototype)和函数对象声明代码: function Sample() { ... } Sample.prototype.method1 = () => { ..
iphone - 如何将caf High quality(sample rate)改成caf Low quality(sample rate)
我正在使用 AVAudioRecorder。我以 44100 采样率以 caf 格式录制音频。就记录成功了。录制后，我想转换已录制的 caf 采样率为 11025 和 22050 的音频文件。是否可

首页

博学

6Ren·AI

商城

python - random.sample 中使用的常数的证明