Why use a prime number in hashCode?

JavaHashcodePrimes

Java Problem Overview


I was just wondering why is that primes are used in a class's hashCode() method? For example, when using Eclipse to generate my hashCode() method there is always the prime number 31 used:

public int hashCode() {
     final int prime = 31;
     //...
}

References:

Here is a good primer on Hashcode and article on how hashing works that I found (C# but the concepts are transferrable): Eric Lippert's Guidelines and rules for GetHashCode()

Java Solutions


Solution 1 - Java

Prime numbers are chosen to best distribute data among hash buckets. If the distribution of inputs is random and evenly spread, then the choice of the hash code/modulus does not matter. It only has an impact when there is a certain pattern to the inputs.

This is often the case when dealing with memory locations. For example, all 32-bit integers are aligned to addresses divisible by 4. Check out the table below to visualize the effects of using a prime vs. non-prime modulus:

Input       Modulo 8    Modulo 7
0           0           0
4           4           4
8           0           1
12          4           5
16          0           2
20          4           6
24          0           3
28          4           0

Notice the almost-perfect distribution when using a prime modulus vs. a non-prime modulus.

However, although the above example is largely contrived, the general principle is that when dealing with a pattern of inputs, using a prime number modulus will yield the best distribution.

Solution 2 - Java

Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations.

Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). Similar entries will collide. Not good for a hash function.

31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2).

Solution 3 - Java

For what it's worth, Effective Java 2nd Edition hand-waives around the mathematics issue and just say that the reason to choose 31 is:

  • Because it's an odd prime, and it's "traditional" to use primes
  • It's also one less than a power of two, which permits for bitwise optimization

Here's the full quote, from Item 9: Always override hashCode when you override equals:

> The value 31 was chosen because it's an odd prime. If it were even and multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. > > A nice property of 31 is that the multiplication can be replaced by a shift (§15.19) and subtraction for better performance: > > 31 * i == (i << 5) - i > > Modern VMs do this sort of optimization automatically. > > --- > > While the recipe in this item yields reasonably good hash functions, it does not yield state-of-the-art hash functions, nor do Java platform libraries provide such hash functions as of release 1.6. Writing such hash functions is a research topic, best left to mathematicians and theoretical computer scientists. > > Perhaps a later release of the platform will provide state-of-the-art hash functions for its classes and utility methods to allow average programmers to construct such hash functions. In the meantime, the techniques described in this item should be adequate for most applications.

Rather simplistically, it can be said that using a multiplier with numerous divisors will result in more hash collisions. Since for effective hashing we want to minimize the number of collisions, we try to use a multiplier that has fewer divisors. A prime number by definition has exactly two distinct, positive divisors.

Solution 4 - Java

I heard that 31 was chosen so that the compiler can optimize the multiplication to left-shift 5 bits then subtract the value.

Solution 5 - Java

Here's a citation a little closer to the source.

It boils down to:

  • 31 is prime, which reduces collisions
  • 31 produces a good distribution, with
  • a reasonable tradeoff in speed

Solution 6 - Java

First you compute the hash value modulo 2^32 (the size of an int), so you want something relatively prime to 2^32 (relatively prime means that there are no common divisors). Any odd number would do for that.

Then for a given hash table the index is usually computed from the hash value modulo the size of the hash table, so you want something that is relatively prime to the size of the hash table. Often the sizes of hash tables are chosen as prime numbers for that reason. In the case of Java the Sun implementation makes sure that the size is always a power of two, so an odd number would suffice here, too. There is also some additional massaging of the hash keys to limit collisions further.

The bad effect if the hash table and the multiplier had a common factor n could be that in certain circumstances only 1/n entries in the hash table would be used.

Solution 7 - Java

The reason why prime numbers are used is to minimize collisions when the data exhibits some particular patterns.

First things first: If the data is random then there’s no need for a prime number, you can do a mod operation against any number and you will have the same number of collisions for each possible value of the modulus.

But when data is not random then strange things happen. For example consider numeric data that is always a multiple of 10.

If we use mod 4 we find:

10 mod 4 = 2

20 mod 4 = 0

30 mod 4 = 2

40 mod 4 = 0

50 mod 4 = 2

So from the 3 possible values of the modulus (0,1,2,3) only 0 and 2 will have collisions, that is bad.

If we use a prime number like 7:

10 mod 7 = 3

20 mod 7 = 6

30 mod 7 = 2

40 mod 7 = 4

50 mod 7 = 1

etc

We also note that 5 is not a good choice but 5 is prime the reason is that all our keys are a multiple of 5. This means we have to choose a prime number that doesn’t divide our keys, choosing a large prime number is usually enough.

So erring on the side of being repetitive the reason prime numbers are used is to neutralize the effect of patterns in the keys in the distribution of collisions of a hash function.

Solution 8 - Java

31 is also specific to Java HashMap which uses a int as hash data type. Thus the max capacity of 2^32. There is no point in using larger Fermat or Mersenne primes.

Solution 9 - Java

It generally helps achieve a more even spread of your data among the hash buckets, especially for low-entropy keys.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionIan DallasView Question on Stackoverflow
Solution 1 - JavaadvaitView Answer on Stackoverflow
Solution 2 - JavaILMTitanView Answer on Stackoverflow
Solution 3 - JavapolygenelubricantsView Answer on Stackoverflow
Solution 4 - JavaSteve KuoView Answer on Stackoverflow
Solution 5 - JavaJohnView Answer on Stackoverflow
Solution 6 - JavastarblueView Answer on Stackoverflow
Solution 7 - JavaAmar MagarView Answer on Stackoverflow
Solution 8 - JavaDEDView Answer on Stackoverflow
Solution 9 - Javauser240438View Answer on Stackoverflow