Is it safe to ignore the possibility of SHA collisions in practice?

HashSha

Hash Problem Overview


Let's say we have a billion unique images, one megabyte each. We calculate the SHA-256 hash for the contents of each file. The possibility of collision depends on:

  • the number of files
  • the size of the single file

How far can we go ignoring this possibility, assuming it is zero?

Hash Solutions


Solution 1 - Hash

The usual answer goes thus: what is the probability that a rogue asteroid crashes on Earth within the next second, obliterating civilization-as-we-know-it, and killing off a few billion people? It can be argued that any unlucky event with a probability lower than that is not actually very important.

If we have a "perfect" hash function with output size n, and we have p messages to hash (individual message length is not important), then probability of collision is about p2/2n+1 (this is an approximation which is valid for "small" p, i.e. substantially smaller than 2n/2). For instance, with SHA-256 (n=256) and one billion messages (p=109) then the probability is about 4.3*10-60.

A mass-murderer space rock happens about once every 30 million years on average. This leads to a probability of such an event occurring in the next second to about 10-15. That's 45 orders of magnitude more probable than the SHA-256 collision. Briefly stated, if you find SHA-256 collisions scary then your priorities are wrong.

In a security setup, where an attacker gets to choose the messages which will be hashed, then the attacker may use substantially more than a billion messages; however, you will find that the attacker's success probability will still be vanishingly small. That's the whole point of using a hash function with a 256-bit output: so that risks of collision can be neglected.

Of course, all of the above assumes that SHA-256 is a "perfect" hash function, which is far from being proven. Still, SHA-256 seems quite robust.

Solution 2 - Hash

The possibility of a collision does not depend on the size of the files, only on their number.

This is an example of the birthday paradox. The Wikipedia page gives an estimate of the likelihood of a collision. If you run the numbers, you'll see that all harddisks ever produced on Earth can't hold enough 1MB files to get a likelihood of a collision of even 0.01% for SHA-256.

Basically, you can simply ignore the possibility.

Solution 3 - Hash

First of all, it is not zero, but very close to zero.

The key question is what happens if a collision actually occurs? If the answer is "a nuclear power plant will explode" then you likely shouldn't ignore the collision possibility. In most cases the consequences are not that dire and so you can ignore the collision possibility.

Also don't forget that you software (or a tiny part of it) might be deployed and simultaneously used in a gazillion of computers (some tiny embedded microcomputers that are almost everywhere nowadays included). In such case you need to multiply the estimate you've got by the largest possible number of copies.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionHristo HristovView Question on Stackoverflow
Solution 1 - HashThomas PorninView Answer on Stackoverflow
Solution 2 - HashMichael BorgwardtView Answer on Stackoverflow
Solution 3 - HashsharptoothView Answer on Stackoverflow