gpt4 book ai didi

algorithm - Knuth 的水库采样伪代码中可能存在错误

转载 作者:塔克拉玛干 更新时间:2023-11-03 05:31:00 24 4
gpt4 key购买 nike

下面是 Knuth 的水库采样伪代码(如何从一组 k 数字中选择 n 数字,确保每个数字都具有相同的概率)。

初始化:一个大小为:k的水库.

for i = k+1 to N
M = random(1, i);

if (M < k) // should this be if (M <= k)
SWAP the Mth value and ith value
end if
end for

从这段代码中,M < K的概率是(k-1)/i , 不是 k/i ,所以我认为 if循环体中的语句应该是if (M < =k) .我试图测试它们之间的区别,但我一无所获。

最佳答案

你是对的。但是,您的代码没有正确实现算法 R。错误是您的(或编写此代码的任何人),而不是 Knuth 的 ;-)

引自 Knuth(计算机编程艺术第 2 卷 3Ed 1998,第 144 页):

... A problem arises if we don't know the value of N in advance, since the precise value of N is crucial in Algorithm S. Suppose we want to select n items at random from a file, without knowing exactly how many are present in that file. We could first go through and count the records, then take a second pass to select them; but it is generally better to sample m > n of the original items on the first pass, where m is much less than N, so that only m items must be considered on the second pass. The trick, of course, is to do this in such a way that the final result is a truly random sample of the original file.

Since we don't know when the input is going to end, we must keep track of a random sample of the input records seen so far, thus always being prepared for the end. As we read the input we will construct a "reservoir" that contains only the records that have appeared among the previous samples. The first n records always go into the reservoir. When the (t + 1)st record is being input, for t>n, we will have in memory a table of n indices pointing to the records that we have chosen from among the first t. The problem is to maintain this situation with t increased by one, namely to find a new random sample from among the t + 1 records now known to be present. It is not hard to see that we should include the new record in the new sample with probability n/(t + 1), and in such a case it should replace a random element of the previous sample.

Thus, the following procedure does the job:

Algorithm R (Reservoir sampling). To select n records at random from a file of unknown size > n, given n > 0. An auxiliary file called the "reservoir" contains all records that are candidates for the final sample. The algorithm uses a table of distinct indices I[j] for 1 < j < n, each of which points to one of the records in the reservoir.

R1. [Initialize.] Input the first n records and copy them to the reservoir. Set I[j] ← j for 1 < j < n, and set t ← m ← n. (If the file being sampled has fewer than n records, it will of course be necessary to abort the algorithm and report failure. During this algorithm, indices I[1], ..., I[n] point to the records in the current sample; m is the size of the reservoir; and t is the number of input records dealt with so far.)

R2. [End of file?] If there are no more records to be input, go to step R6.

R3. [Generate and test.] Increase t by 1, then generate a random integer M between 1 and t (inclusive). If M > n, go to R5.

R4. [Add to reservoir.] Copy the next record of the input file to the reservoir, increase m by 1, and set I[M] ← m. (The record previously pointed to by I[M] is being replaced in the sample by the new record.) Go back to R2.

R5. [Skip.] Skip over the next record of the input file (do not include it in the reservoir), and return to step R2.

R6. [Second pass.] Sort the I table entries so that I[1] < ... < I[n]; then go through the reservoir, copying the records with these indices into the output file that is to hold the final sample.

算法 R 的伪代码如下所示:

for j= 1 to n
Reservoir[j]= File.GetNext()
I[j]= j

t=n // number of input records so far
m=n // size of the reservoir

while not File.EOF()
x= File.GetNext()
t++
M= Random(1..t)
if (M<=n)
m++
Reservoir[m]= x
I[M]= m

Sort(I[1..n])

for j= 1 to n
Output[j]= Reservoir[I[j]]

关于algorithm - Knuth 的水库采样伪代码中可能存在错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17638962/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com