gpt4 book ai didi

python - 使用堆进行大磁盘排序

转载 作者:塔克拉玛干 更新时间:2023-11-03 03:35:02 24 4
gpt4 key购买 nike

关于官方 Python 文档 here ,有人提到:

Heaps are also very useful in big disk sorts. You most probably all know that a big sort implies producing “runs” (which are pre-sorted sequences, whose size is usually related to the amount of CPU memory), followed by a merging passes for these runs, which merging is often very cleverly organised.
It is very important that the initial sort produces the longest runs possible. Tournaments are a good way to achieve that. If, using all the memory available to hold a tournament, you replace and percolate items that happen to fit the current run, you’ll produce runs which are twice the size of the memory for random input, and much better for input fuzzily ordered.

Moreover, if you output the 0’th item on disk and get an input which may not fit in the current tournament (because the value “wins” over the last output value), it cannot fit in the heap, so the size of the heap decreases. The freed memory could be cleverly reused immediately for progressively building a second heap, which grows at exactly the same rate the first heap is melting.
When the first heap completely vanishes, you switch heaps and start a new run. Clever and quite effective!

我知道一个叫做 External sorting 的算法我们:

  1. 将输入分解成更小的 block 。
  2. 分别对所有 block 进行排序,然后将它们一个接一个地写回磁盘。
  3. 创建一个堆并在所有排序的 block 之间进行 k 路合并。

我完全理解维基百科上描述的外部排序,但我无法理解作者所说的:

If, using all the memory available to hold a tournament, you replace and percolate items that happen to fit the current run, you’ll produce runs which are twice the size of the memory for random input, and much better for input fuzzily ordered.

和:

Moreover, if you output the 0’th item on disk and get an input which may not fit in the current tournament (because the value “wins” over the last output value), it cannot fit in the heap, so the size of the heap decreases.

这是什么堆熔化

最佳答案

堆熔不是一回事。 heap getting smaller 就是作者用的词,就是拉出最小的元素。

他所说的想法是对外部排序中“将输入分成 block 并对 block 进行排序”部分的巧妙替代。它产生更大的排序 block 。

这个想法是,首先将最大的 block 读入内存并将其排列到堆中,然后在读入新元素时开始从堆中写出最小的元素。

当你读入一个小于你已经写出的元素的元素时,它不能进入​​当前 block (它会破坏排序),所以你会记住下一个 block 。可以将不小于您最后写出的元素的元素插入堆中。它们将进入当前 block ,使当前 block 变大。

最终你的堆将是空的。到那时你就完成了当前的 block ——堆化你记住的所有元素并开始写出下一个 block 。

关于python - 使用堆进行大磁盘排序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56410068/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com