# Why is prime used in the polynomial hash

## Why use a prime number in hashCode?

Because you want the number you multiply by and the number of buckets you insert into to be orthogonal prime factors.

Suppose there are 8 buckets to insert. If the number you are multiplying by is a multiple of 8, only the least significant entry (the one that is not multiplied at all) determines the inserted bucket. Similar entries collide. Not good for a hash function.

31 is a prime number large enough that the number of buckets is unlikely to be divisible by it (and in fact, modern Java HashMap implementations keep the number of buckets to a power of 2).

Prime numbers are chosen to best distribute data among hash buckets. If the distribution of the inputs is random and evenly distributed, the choice of hash code / module does not matter. This only has an effect if the inputs have a certain pattern.

This is often the case when dealing with storage space. For example, all 32-bit integers are aligned with addresses that are divisible by 4. The following table shows the effects of using a prim or non-prim module:

Note the near perfect distribution when using a prime versus a non-prime module.

Although the above example is largely made up, the general principle is that when treating a Pattern of inputs the Using a prime module gives the best distribution.

For what it's worth, waived Effective Java 2nd Edition by Hand on the math problem and just saying that the reason for choosing 31 is:

• Because it's an odd prime, and it's "traditional" to use prime numbers
• It is also one less than a power of two, which allows for bit-wise optimization

Here's the full quote Point 9: Always overwrite when you overwrite :

The value 31 was chosen because it is an odd prime number. If it were straight and the multiplication overflows, information would be lost since the multiplication by 2 equates to a shift. The benefit of using a prime number is less clear, but traditional.

A nice feature of 31 is that multiplication can be replaced with a shift (§15.19) and subtraction for better performance:

Modern VMs do this type of optimization automatically.

While the recipe in this article provides reasonably good hash functions, it does not provide state-of-the-art hash functions, nor do Java platform libraries provide such hash functions as of version 1.6. Writing such hash functions is a research topic that is best left to mathematicians and theoretical computer scientists.

A later version of the platform may provide hash functions for its classes and state-of-the-art utility methods so that average programmers can create such hash functions. In the meantime, the techniques outlined in this article should work for most applications.

Rather simplified, it can be said that using a multiplier with numerous divisors leads to more hash collisions. Since we want to minimize the number of collisions for effective hashing, we try to use a multiplier with fewer factors. By definition, a prime number has exactly two different positive factors.

### Related questions

Heard 31 was chosen to allow the compiler to optimize the multiplication to shift 5 bits to the left and then subtract the value.

Here is a quote a little closer to the source.

It boils down to:

• 31 is a prime number that reduces collisions
• 31 gives a good distribution with
• a reasonable compromise in speed

First you compute the hash value modulo 2 ^ 32 (the size of an), so you want something relatively primes to 2 ^ 32 (relatively prim means there are no common factors). Any odd number would be sufficient for this.

Then, for a given hash table, the index is usually calculated from the hash value modulo the size of the hash table, so you want something that is relatively prime to the size of the hash table. For this reason, the sizes of hash tables are often chosen as prime numbers. In the case of Java, the Sun implementation ensures that the size is always a power of two, so that an odd number would also suffice here. There are also some additional massages of the hash keys to further limit collisions.

The bad effect if the hash table and the multiplier had a common factor could be that under certain circumstances only 1 / n entries are used in the hash table.

The reason prime numbers are used is to minimize collisions when the data has certain patterns.

First things first, if the data is random, no prime number is required. You can do a mod operation on any number and have the same number of collisions for every possible value of the module.

But when dates aren't random, strange things happen. For example, consider numeric data, which is always a multiple of 10.

When we use Mod 4 we find:

10 mod 4 = 2

20 mod 4 = 0

30 mod 4 = 2

40 mod 4 = 0

50 mod 4 = 2

Of the 3 possible values ​​of the module (0,1,2,3) only 0 and 2 have collisions, which is bad.

If we use a prime like 7:

10 mod 7 = 3

20 mod 7 = 6

30 mod 7 = 2

40 mod 7 = 4

50 mod 7 = 1

Etc

We also find that 5 is not a good choice, but 5 is a prime number. The reason for this is that all of our keys are multiples of 5. This means that we have to choose a prime number that does not share our keys. Choosing a large prime number is usually enough.

The reason prime numbers are used is to neutralize the effect of patterns in the keys in distributing collisions of a hash function.

31 is also specific to Java HashMap, which uses an int as the hash data type. So the maximum capacity is 2 ^ 32. It makes no sense to use larger Fermat or Mersenne prime numbers.

This generally helps in achieving a more even distribution of your data across the hash buckets, especially with low entropy keys.

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.

By continuing, you consent to our use of cookies and other tracking technologies and affirm you're at least 16 years old or have consent from a parent or guardian.