gpt4 book ai didi

java - 为什么哈希表会通过加倍来调整大小?

转载 作者:搜寻专家 更新时间:2023-10-30 21:29:48 24 4
gpt4 key购买 nike

检查 java 并在线搜索哈希表代码示例似乎是通过加倍来调整表的大小。
但是大多数教科书都说表格的最佳尺寸是质数。
所以我的问题是:
加倍的做法是因为:

  1. 很容易实现,或者
  2. 寻找质数是否效率太低(但我认为寻找下一个素数遍历 n+=2 并使用模是 O(loglogN) 这是便宜的)
  3. 或者这是我的误解,只是某些哈希表的变体只需要素表大小?

更新:
教科书中介绍的使用质数的方式是某些属性起作用所必需的(例如,二次探查需要一个质数大小的表来证明,例如,如果表不完整,将插入项目 X)。
作为重复发布的链接通常询问有关增加任何数字的问题,例如25% 或下一个质数,接受的答案表明我们加倍以保持调整大小操作“罕见”,因此我们可以保证摊销时间。
这没有回答具有素数的表大小和使用素数来调整大小甚至大于两倍的问题。所以想法是在考虑调整大小的开销的情况下保持主要大小的属性

最佳答案

Q: But most textbooks say that the best size for the table is a prime number.

Regarding size primality:

What comes to primality of size, it depends on collision resolution algorithm your choose. Some algorithms require prime table size (double hashing, quadratic hashing), others don't, and they could benefit from table size of power of 2, because it allows very cheap modulo operations. However, when closest "available table sizes" differ in 2 times, memory usage of hash table might be unreliable. So, even using linear hashing or separate chaining, you can choose non power of 2 size. In this case, in turn, it's worth to choose particulary prime size, because:

If you pick prime table size (either because algorithm requires this, or because you are not satisfied with memory usage unreliability implied by power-of-2 size), table slot computation (modulo by table size) could be combined with hashing. See this answer for more.

The point that table size of power of 2 is undesirable when hash function distribution is bad (from the answer by Neil Coffey) is impractical, because even if you have bad hash function, avalanching it and still using power-of-2 size would be faster that switching to prime table size, because a single integral division is still slower on modern CPUs that several of multimplications and shift operations, required by good avalanching functions, e. g. from MurmurHash3.


Q: Also to be honest I got lost a bit on if you actually recommend primes or not. Seems that it depends on the hash table variant and the quality of the hash function?

  1. 散列函数的质量无关紧要,您始终可以通过 MurMur3 雪崩“改进”散列函数,这比从 2 的幂表大小切换到素数表大小更便宜,请参见上文。

  2. 我建议使用 QHash 或二次哈希算法 ( aren't same ) 选择质数大小,当您需要精确控制哈希表负载因子并且可预测的高实际负载。对于 2 的幂表大小,最小调整因子为 2,通常我们不能保证哈希表的实际负载因子会高于 0.5。 See this answer.

    否则,我建议使用线性探测的 2 次幂大小的哈希表。

Q: Is the approach of doubling because:
It is easy to implement, or

基本上,在很多情况下,是的。参见 this large answer regarding load factors :

Load factor is not an essential part of hash table data structure -- it is the way to define rules of behaviour for the dymamic system (growing/shrinking hash table is a dynamic system).

Moreover, in my opinion, in 95% of modern hash table cases this way is over simplified, dynamic systems behave suboptimally.

什么是加倍?这只是最简单的调整大小策略。该策略可以任意复杂,在您的用例中以最佳方式执行。它可以考虑当前的哈希表大小、增长强度(自上次调整大小以来完成了多少获取操作)等。没有人禁止您实现此类自定义调整大小逻辑。

Q: Is finding a prime number too inefficient (but I think that finding the next prime going over n+=2 and testing for primality using modulo is O(loglogN) which is cheap)

预先计算素数哈希表大小的某些子集是一种很好的做法,可以在运行时使用二进制搜索在它们之间进行选择。参见 the list double hash capacities and explaination , QHash capacities .或者,甚至使用 direct lookup , 那是非常快的。

Q: Or this is my misunderstanding and only certain hashtable variants only require prime table size?

是的,只有某些类型需要,见上文。

关于java - 为什么哈希表会通过加倍来调整大小?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30382783/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com