How many random elements before MD5 produces collisions?

RandomMd5Hash

Random Problem Overview


I've got an image library on Amazon S3. For each image, I md5 the source URL on my server plus a timestamp to get a unique filename. Since S3 can't have subdirectories, I need to store all of these images in a single flat folder.

Do I need to worry about collisions in the MD5 hash value that gets produced?

Bonus: How many files could I have before I'd start seeing collisions in the hash value that MD5 produces?

Random Solutions


Solution 1 - Random

Probability of just two hashes accidentally colliding is 1/2128 which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456.

However if you keep all the hashes then the probability is a bit higher thanks to birthday paradox. To have a 50% chance of any hash colliding with any other hash you need 264 hashes. This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.

Solution 2 - Random

S3 can have subdirectories. Just put a "/" in the key name, and you can access the files as if they were in separate directories. I use this to store user files in separate folders based on their user ID in S3.

For example: "mybucket/users/1234/somefile.jpg". It's not exactly the same as a directory in a file system, but the S3 API has some features that let it work almost the same. I can ask it to list all files that begin with "users/1234/" and it will show me all the files in that "directory".

Solution 3 - Random

So wait, is it:

md5(filename) + timestamp

or:

md5(filename + timestamp)

If the former, you are most of the way to a GUID, and I wouldn't worry about it. If the latter, then see Karg's post about how you will run into collisions eventually.

Solution 4 - Random

A rough rule of thumb for collisions is the square-root of the range of values. Your MD5 sig is presumably 128 bits long, so you're going to be likely to see collisions above and beyond 2^64 images.

Solution 5 - Random

Although random MD5 collisions are exceedingly rare, if your users can provide files (that will be stored verbatim) then they can engineer collisions to occur. That is, they can deliberately create two files with the same MD5sum but different data. Make sure your application can handle this case in a sensible way, or perhaps use a stronger hash like SHA-256.

Solution 6 - Random

While there have been well publicized problems with MD5 due to collisions, UNINTENTIONAL collisions among random data are exceedingly rare. On the other hand, if you are hashing on the file name, that's not random data, and I would expect collisions quickly.

Solution 7 - Random

Doesn't really matter how likely it is; it is possible. It could happen on the first two things you hash (very unlikely, but possible), so you'll need to support collisions from the beginning.

Solution 8 - Random

MD5 collision is extremely unlikely. If you have 9 trillion MD5s, there is only one chance in 9 trillion that there will be a collision.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionBen ThroopView Question on Stackoverflow
Solution 1 - RandomKornelView Answer on Stackoverflow
Solution 2 - RandomdavrView Answer on Stackoverflow
Solution 3 - RandomRyanView Answer on Stackoverflow
Solution 4 - RandomWill DeanView Answer on Stackoverflow
Solution 5 - RandombdonlanView Answer on Stackoverflow
Solution 6 - RandomacrosmanView Answer on Stackoverflow
Solution 7 - RandomKargView Answer on Stackoverflow
Solution 8 - RandomRick JamesView Answer on Stackoverflow