gpt4 book ai didi

带有生成器/可迭代/迭代器的 Python 随机样本

转载 作者:IT老高 更新时间:2023-10-28 21:51:50 28 4
gpt4 key购买 nike

你知道是否有办法让 python 的 random.sample 与生成器对象一起工作。我正在尝试从一个非常大的文本语料库中获取随机样本。问题是 random.sample() 引发以下错误。

TypeError: object of type 'generator' has no len()

我在想也许有一些方法可以使用 itertools 中的某些东西来做到这一点,但通过一些搜索找不到任何东西。

一个虚构的例子:

import random
def list_item(ls):
for item in ls:
yield item

random.sample( list_item(range(100)), 20 )


更新


根据 MartinPieters 的要求,我对当前提出的三种方法进行了一些计时。结果如下。

Sampling 1000 from 10000
Using iterSample 0.0163 s
Using sample_from_iterable 0.0098 s
Using iter_sample_fast 0.0148 s

Sampling 10000 from 100000
Using iterSample 0.1786 s
Using sample_from_iterable 0.1320 s
Using iter_sample_fast 0.1576 s

Sampling 100000 from 1000000
Using iterSample 3.2740 s
Using sample_from_iterable 1.9860 s
Using iter_sample_fast 1.4586 s

Sampling 200000 from 1000000
Using iterSample 7.6115 s
Using sample_from_iterable 3.0663 s
Using iter_sample_fast 1.4101 s

Sampling 500000 from 1000000
Using iterSample 39.2595 s
Using sample_from_iterable 4.9994 s
Using iter_sample_fast 1.2178 s

Sampling 2000000 from 5000000
Using iterSample 798.8016 s
Using sample_from_iterable 28.6618 s
Using iter_sample_fast 6.6482 s

因此,当涉及到大样本时,array.insert 有一个严重的缺陷。我用来计时方法的代码

from heapq import nlargest
import random
import timeit


def iterSample(iterable, samplesize):
results = []
for i, v in enumerate(iterable):
r = random.randint(0, i)
if r < samplesize:
if i < samplesize:
results.insert(r, v) # add first samplesize items in random order
else:
results[r] = v # at a decreasing rate, replace random items

if len(results) < samplesize:
raise ValueError("Sample larger than population.")

return results

def sample_from_iterable(iterable, samplesize):
return (x for _, x in nlargest(samplesize, ((random.random(), x) for x in iterable)))

def iter_sample_fast(iterable, samplesize):
results = []
iterator = iter(iterable)
# Fill in the first samplesize elements:
for _ in xrange(samplesize):
results.append(iterator.next())
random.shuffle(results) # Randomize their positions
for i, v in enumerate(iterator, samplesize):
r = random.randint(0, i)
if r < samplesize:
results[r] = v # at a decreasing rate, replace random items

if len(results) < samplesize:
raise ValueError("Sample larger than population.")
return results

if __name__ == '__main__':
pop_sizes = [int(10e+3),int(10e+4),int(10e+5),int(10e+5),int(10e+5),int(10e+5)*5]
k_sizes = [int(10e+2),int(10e+3),int(10e+4),int(10e+4)*2,int(10e+4)*5,int(10e+5)*2]

for pop_size, k_size in zip(pop_sizes, k_sizes):
pop = xrange(pop_size)
k = k_size
t1 = timeit.Timer(stmt='iterSample(pop, %i)'%(k_size), setup='from __main__ import iterSample,pop')
t2 = timeit.Timer(stmt='sample_from_iterable(pop, %i)'%(k_size), setup='from __main__ import sample_from_iterable,pop')
t3 = timeit.Timer(stmt='iter_sample_fast(pop, %i)'%(k_size), setup='from __main__ import iter_sample_fast,pop')

print 'Sampling', k, 'from', pop_size
print 'Using iterSample', '%1.4f s'%(t1.timeit(number=100) / 100.0)
print 'Using sample_from_iterable', '%1.4f s'%(t2.timeit(number=100) / 100.0)
print 'Using iter_sample_fast', '%1.4f s'%(t3.timeit(number=100) / 100.0)
print ''

我还进行了一项测试,以检查所有方法确实确实采用了生成器的无偏样本。因此,对于所有方法,我从 10000 100000 次中采样了 1000 元素,并计算了总体中每个项目的平均出现频率,结果为是 ~.1 ,正如人们对所有三种方法所期望的那样。

最佳答案

虽然 Martijn Pieters 的答案是正确的,但当 samplesize 变大时它确实会变慢,因为在循环中使用 list.insert 可能具有二次复杂度。

在我看来,这是一种在提高性能的同时保持一致性的替代方案:

def iter_sample_fast(iterable, samplesize):
results = []
iterator = iter(iterable)
# Fill in the first samplesize elements:
try:
for _ in xrange(samplesize):
results.append(iterator.next())
except StopIteration:
raise ValueError("Sample larger than population.")
random.shuffle(results) # Randomize their positions
for i, v in enumerate(iterator, samplesize):
r = random.randint(0, i)
if r < samplesize:
results[r] = v # at a decreasing rate, replace random items
return results

对于高于 10000samplesize 值,差异慢慢开始显现。 (1000000, 100000)调用次数:

  • iterSample:5.05 秒
  • iter_sample_fast:2.64 秒

关于带有生成器/可迭代/迭代器的 Python 随机样本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12581437/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com