Is MD5 still good enough to uniquely identify files?

HashMd5

Hash Problem Overview


Is MD5 hashing a file still considered a good enough method to uniquely identify it given all the breaking of MD5 algorithm and security issues etc? Security is not my primary concern here, but uniquely identifying each file is.

Any thoughts?

Hash Solutions


Solution 1 - Hash

Yes. MD5 has been completely broken from a security perspective, but the probability of an accidental collision is still vanishingly small. Just be sure that the files aren't being created by someone you don't trust and who might have malicious intent.

Solution 2 - Hash

For practical purposes, the hash created might be suitably random, but theoretically there is always a probability of a collision, due to the Pigeonhole principle. Having different hashes certainly means that the files are different, but getting the same hash doesn't necessarily mean that the files are identical.

Using a hash function for that purpose - no matter whether security is a concern or not - should therefore always only be the first step of a check, especially if the hash algorithm is known to easily create collisions. To reliably find out if two files with the same hash are different you would have to compare those files byte-by-byte.

Solution 3 - Hash

MD5 will be good enough if you have no adversary. However, someone can (purposely) create two distinct files which hash to the same value (that's called a collision), and this may or may not be a problem, depending on your exact situation.

Since knowing whether known MD5 weaknesses apply to a given context is a subtle matter, it is recommended not to use MD5. Using a collision-resistant hash function (SHA-256 or SHA-512) is the safe answer. Also, using MD5 is bad public relations (if you use MD5, be prepared to have to justify yourselves; whereas nobody will question your using SHA-256).

Solution 4 - Hash

An md5 can produce collisions. Theoretically, although highly unlikely, a million files in a row can produce the same hash. Don't test your luck and check for md5 collisions before storing the value.

I personally like to create md5 of random strings, which reduces the overhead of hashing large files. When collisions are found, I iterate and re-hash with the appended loop counter.

You may read on the pigeonhole principle.

Solution 5 - Hash

I wouldn't recommend it. If the application would work on multi-user system, there might be user, that would have two files with the same md5 hash (he might be engineer and play with such files, or be just curious - they are easily downloadable from http://www2.mat.dtu.dk/people/S.Thomsen/wangmd5/samples.html , I myself during writing this answer downloaded two samples). Another thing is, that some applications might store such duplicates for whatever reason (I'm not sure, if there are any such applications but the possibility exists).

If you are uniquely identifying files generated by your program I would say it is ok to use MD5. Otherwise, I would recommend any other hash function where no collisions are known yet.

Solution 6 - Hash

Personally i think people use raw checksums (pick your method) of other objects to act as unique identifiers way too much when they really want to do is have unique identifiers. Fingerprinting an object for this use wasn't the intent and is likely to require more thinking than using a uuid or similar integrity mechanism.

Solution 7 - Hash

MD5 has been broken, you could use SHA1 instead (implemented in most languages)

Solution 8 - Hash

When hashing short (< a few K ?) strings (or files) one can create two md5 hash keys, one for the actual string and a second one for the reverse of the string concatenated with a short asymmetric string. Example : md5 ( reverse ( string || '1010' ) ). Adding the extra string ensures that even files consisting of a series of identical bits generate two different keys. Please understand that even under this scheme there is a theoretical chance of the two hash keys being identical for non-identical strings, but the probability seems exceedingly small - something in the order of the square of the single md5 collision probability, and the time saving can be considerable when the number of files is growing. More elaborate schemes for creating the second string could be considered as well, but I am not sure that these would substantially improve the odds.

To check for collisions one can run this test for the uniqueness of the md5 hash keys for all bit_vectors in a db:

select md5 ( bit_vector ), count(*), bit_and ( bit_vector) from db with bit_vector
group by md5( bit_vector ), bit_vector having bit_and ( bit_vector ) <> bit_vector

Solution 9 - Hash

I like to think of MD5 as an indicator of probability when storing a large amount of file data.

If the hashes are equal I then know I have to compare the files byte by byte, but that might only happen a few times for a false reason, otherwise (hashes are not equal) I can be certain we're talking about two different files.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionRanhiru Jude CoorayView Question on Stackoverflow
Solution 1 - HashMarcelo CantosView Answer on Stackoverflow
Solution 2 - HashstapeluberlaufView Answer on Stackoverflow
Solution 3 - HashThomas PorninView Answer on Stackoverflow
Solution 4 - HashafilinaView Answer on Stackoverflow
Solution 5 - HashtachView Answer on Stackoverflow
Solution 6 - HashhpavcView Answer on Stackoverflow
Solution 7 - HashGuillaume LebourgeoisView Answer on Stackoverflow
Solution 8 - HashmarcopoloView Answer on Stackoverflow
Solution 9 - HashShimmy WeitzhandlerView Answer on Stackoverflow