How are hash functions like MD5 unique?

HashTheoryMd5Uniqueidentifier

Hash Problem Overview


I'm aware that MD5 has had some collisions but this is more of a high-level question about hashing functions.

If MD5 hashes any arbitrary string into a 32-digit hex value, then according to the [Pigeonhole Principle][1] surely this can not be unique, as there are more unique arbitrary strings than there are unique 32-digit hex values.

[1]: https://en.wikipedia.org/wiki/Pigeonhole_principle "Pigeonhole Principle"

Hash Solutions


Solution 1 - Hash

You're correct that it cannot guarantee uniqueness, however there are approximately 3.402823669209387e+38 different values in a 32 digit hex value (16^32). That means that, assuming the math behind the algorithm gives a good distribution, your odds are phenomenally small that there will be a duplicate. You do have to keep in mind that it IS possible to duplicate when you're thinking about how it will be used. MD5 is generally used to determine if something has been changed (I.e. it's a checksum). It would be ridiculously unlikely that something could be modified and result in the same MD5 checksum.

Edit: (given recent news re: SHA1 hashes) The answer above, still holds, but you shouldn't expect an MD5 hash to serve as any kind of security check against manipulation. SHA-1 Hashes as 2^32 (over 4 billion) times less likely to collide, and it has been demonstrated that it is possible to contrive an input to produce the same value. (This was demonstrated against MD5 quite some time ago). If you're looking to ensure nobody has maliciously modified something to produce the same hash value, these days, you need at SHA-2 to have a solid guarantee.

On the other hand, if it's not in a security check context, MD5 still has it's usefulness.

The argument could be made that an SHA-2 hash is cheap enough to compute, that you should just use it anyway.

Solution 2 - Hash

You are absolutely correct. But hashes are not about "unique", they are about "unique enough".

Solution 3 - Hash

As others have pointed out, the goal of a hash function like MD5 is to provide a way of easily checking whether two objects are equivalent, without knowing what they originally were (passwords) or comparing them in their entirety (big files).

Say you have an object O and its hash hO. You obtain another object P and wish to check whether it is equal to O. This could be a password, or a file you downloaded (in which case you won't have O but rather the hash of it hO that came with P, most likely). First, you hash P to get hP.

There are now 2 possibilities:

  1. hO and hP are different. This must mean that O and P are different, because using the same hash on 2 values/objects must yield the same value. Hashes are deterministic. There are no false negatives.

  2. hO and hP are equal. As you stated, because of the Pigeonhole Principle this could mean that different objects hashed to the same value, and further action may need to be taken.

    a. Because the number of possibilities is so high, if you have faith in your hash function it may be enough to say "Well there was a 1 in 2128 chance of collision (ideal case), so we can assume O = P. This may work for passwords if you restrict the length and complexity of characters, for example. It is why you see hashes of passwords stored in databases rather than the passwords themselves. b. You may decide that just because the hash came out equal doesn't mean the objects are equal, and do a direct comparison of O and P. You may have a false positive.

So while you may have false positive matches, you won't have false negatives. Depending on your application, and whether you expect the objects to always be equal or always be different, hashing may be a superfluous step.

Solution 4 - Hash

Cryptographic one-way hash functions are, by nature of definition, not Injective. In terms of hash functions, "unique" is pretty meaningless. These functions are measured by other attributes, which affects their strength by making it hard to create a pre-image of a given hash. For example, we may care about how many image bits are affected by changing a single bit in the pre-image. We may care about how hard it is to conduct a brute force attack (finding a prie-image for a given hash image). We may care about how hard it is to find a collision: finding two pre-images that have the same hash image, to be used in a birthday attack.

Solution 5 - Hash

While it is likely that you get collisions if the values to be hashed are much longer than the resulting hash, the number of collisions is still sufficiently low for most purposes (there are 2128 possible hashes total so the chance of two random strings producing the same hash is theoretically close to 1 in 1038).

MD5 was primarily created to do integrity checks, so it is very sensitive to minimal changes. A minor modification in the input will result in a drastically different output. This is why it is hard to guess a password based on the hash value alone.

While the hash itself is not reversible, it is still possible to find a possible input value by pure brute force. This is why you should always make sure to add a salt if you are using MD5 to store password hashes: if you include a salt in the input string, a matching input string has to include exactly the same salt in order to result in the same output string because otherwise the raw input string that matches the output will fail to match after the automated salting (i.e. you can't just "reverse" the MD5 and use it to log in because the reversed MD5 hash will most likely not be the salted string that originally resulted in the creation of the hash).

So hashes are not unique, but the authentication mechanism can be made to make it sufficiently unique (which is one somewhat plausible argument for password restrictions in lieu of salting: the set of strings that results in the same hash will probably contain many strings that do not obey the password restrictions, so it's more difficult to reverse the hash by brute force -- obviously salts are still a good idea nevertheless).

Bigger hashes mean a larger set of possible hashes for the same input set, so a lower chance of overlap, but until processing power advances sufficiently to make brute-forcing MD5 trivial, it's still a decent choice for most purposes.

Solution 6 - Hash

(It seems to be Hash Function Sunday.)

Cryptographic hash functions are designed to have very, very, very, low duplication rates. For the obvious reason you state, the rate can never be zero.

The Wikipedia page is informative.

Solution 7 - Hash

As Mike (and basically every one else) said, its not perfect, but it does the job, and collision performance really depends on the algo (which is actually pretty good).

What is of real interest is automatic manipulation of files or data to keep the same hash with different data, see this Demo

Solution 8 - Hash

As others have answered, hash functions are by definition not guaranteed to return unique values, since there are a fixed number of hashes for an infinite number of inputs. Their key quality is that their collisions are unpredictable.

In other words, they're not easily reversible -- so while there may be many distinct inputs that will produce the same hash result (a "collision"), finding any two of them is computationally infeasible.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAlyView Question on Stackoverflow
Solution 1 - HashMike CargalView Answer on Stackoverflow
Solution 2 - HashIgnacio Vazquez-AbramsView Answer on Stackoverflow
Solution 3 - HashPhilView Answer on Stackoverflow
Solution 4 - HashM.A. HaninView Answer on Stackoverflow
Solution 5 - HashAlan PlumView Answer on Stackoverflow
Solution 6 - HashbmarguliesView Answer on Stackoverflow
Solution 7 - HashBolsterView Answer on Stackoverflow
Solution 8 - HashPinkoView Answer on Stackoverflow