What's the shortest pair of strings that causes an MD5 collision?

Math Problem Overview

Up to what string length is it possible to use MD5 as a hash without having to worry about the possibility of a collision?

This would presumably be calculated by generating an MD5 hash for every possible string in a particular character set, in increasing length, until a hash appears for a second time (a collision). The maximum possible length of a string without a collision would then be one character less than the longest of the colliding pair.

Has this already been tested for MD5, SHA1, etc?

Math Solutions

Solution 1 - Math

Update

Ironically, a few weeks after I posted the previous answer, two Chinese researchers, Tao Xie and Dengguo Feng, published a new single-block collision for MD5. I was unaware of that paper until now. A single MD5 block means that the input size is 64 bytes or 512 bits. Note that the inputs are mostly the same, differing only in 2 bits.

Their methodology won't be published until January 2013, but their collision can be verified now, using numbers from the paper:

>>> from array import array
>>> from hashlib import md5
>>> input1 = array('I',  [0x6165300e,0x87a79a55,0xf7c60bd0,0x34febd0b,0x6503cf04,
    0x854f709e,0xfb0fc034,0x874c9c65,0x2f94cc40,0x15a12deb,0x5c15f4a3,0x490786bb,
    0x6d658673,0xa4341f7d,0x8fd75920,0xefd18d5a])
>>> input2 = array('I', [x^y for x,y in zip(input1,
    [0, 0, 0, 0, 0, 1<<10, 0, 0, 0, 0, 1<<31, 0, 0, 0, 0, 0])])
>>> input1 == input2
False
>>> md5(input1).hexdigest()
'cee9a457e790cf20d4bdaa6d69f01e41'
>>> md5(input2).hexdigest()
'cee9a457e790cf20d4bdaa6d69f01e41'

Update: The paper has been published in March 2013: Tao Xie and Fanbao Liu and Dengguo Feng - Fast Collision Attack on MD5

However, if you have more room to play with, collisions of a few kilobytes are MUCH faster to calculate -- they can be calculated within hours on ANY regular computer.

Old answer

The previous shortest collision used at least two MD5 blocks worth of input -- that's 128 bytes, 1024 bits. A prefix in the first block can be chosen arbitrarily by the attacker, the rest would be computed and appear as gibberish.

Here's an example of two different colliding inputs, you can try it yourself in Python:

>>> from binascii import unhexlify
>>> from hashlib import md5
>>> input1 = 'Oded Goldreich\nOded Goldreich\nOded Goldreich\nOded Go' + unhexlify(
... 'd8050d0019bb9318924caa96dce35cb835b349e144e98c50c22cf461244a4064bf1afaecc582'
... '0d428ad38d6bec89a5ad51e29063dd79b16cf67c12978647f5af123de3acf844085cd025b956')
>>> len(input1)
128
>>> md5(input1).hexdigest()
'd320b6433d8ebc1ac65711705721c2e1'
>>> input2 = 'Neal Koblitz\nNeal Koblitz\nNeal Koblitz\nNeal Koblitz\n' + unhexlify(
... '75b80e0035f3d2c909af1baddce35cb835b349e144e88c50c22cf461244a40e4bf1afaecc582'
... '0d428ad38d6bec89a5ad51e29063dd79b16cf6fc11978647f5af123de3acf84408dcd025b956')
>>> md5(input2).hexdigest()
'd320b6433d8ebc1ac65711705721c2e1'

Generating these two particular inputs took 2 days on a 215-node Playstation 3 cluster, by Mark Stevens :)

Solution 2 - Math

The mathematics of the birthday paradox make the inflection point of probability of collision roughly around sqrt(N), where N is the number of distinct bins in the hash function, so for a 128-bit hash, as you get around 64 bits you are moderately likely to have 1 collision. So my guess is for the complete set of 8 byte strings it's somewhat likely to have a collision, and for 9 byte strings it's extremely likely.

edit: this assumes that the MD5 hash algorithm causes a mapping from input bytestring to output hash that is close to "random". (vs. one that distributes strings more evenly among the set of possible hashes, in which case it would be more close to 16 bytes.)

Also for a more specific numerical answer, if you look at one of the approximations for calculating collision probability, you get

p(k) ≈ 1 - e^{-k(k-1)/(2*2¹²⁸)} where k = the size of the space of possible inputs = 2^m where the input bytestring is m bits long.

the set of 8 byte strings: p(2⁶⁴) ≈ 1 - e^-0.5 ≈ 0.3935

the set of 9 byte strings: p(2⁷²) ≈ 1 - e^{-2¹⁴⁴/(2*2¹²⁸)} = 1 - e^-2¹⁵ = 1 - e^-32768 ≈ 1

Also note that these assume the complete set of m/8 byte strings. If you only use alphanumeric characters, you'd need more bytes to get a probable collision.

Solution 3 - Math

I doubt if there is any useful length where you're not going to have possible collisions. Those algorithms are not really used for that purpose. It's meant to try to be unique for slight changes in the data (like corrupted files) rather than unique over all possible sets of data.

Content Type	Original Author	Original Content on Stackoverflow
Question	Alf Eaton	View Question on Stackoverflow
Solution 1 - Math	intgr	View Answer on Stackoverflow
Solution 2 - Math	Jason S	View Answer on Stackoverflow
Solution 3 - Math	aDev	View Answer on Stackoverflow