What is the best 32bit hash function for short strings (tag names)?

AlgorithmHashTags32 Bit

Algorithm Problem Overview


What is the best 32bit hash function for relatively short strings?

Strings are tag names that consist of English letters, numbers, spaces and some additional characters (#, $, ., ...). For example: Unit testing, C# 2.0.

I am looking for 'best' as in 'minimal collisions', performance is not important for my goals.

Algorithm Solutions


Solution 1 - Algorithm

I'm not sure if it's the best choice, but here is a hash function for strings:

The Practice of Programming (HASH TABLES, pg. 57)

/* hash: compute hash value of string */
unsigned int hash(char *str)
{
   unsigned int h;
   unsigned char *p;

   h = 0;
   for (p = (unsigned char*)str; *p != '\0'; p++)
      h = MULTIPLIER * h + *p;
   return h; // or, h % ARRAY_SIZE;
}

> Empirically, the values 31 and 37 have proven to be good choices for the multiplier in a hash function for ASCII strings.

Solution 2 - Algorithm

If performance isn't important, simply take a secure hash such as MD5 or SHA1, and truncate its output to 32 bits. This will give you a distribution of hash codes that's indistinguishable from random.

Solution 3 - Algorithm

I'm sorry for the very late reply on this. Earlier this year I composed a page titled Hashing Short Strings which might be helpful in this discussion. In summary, I found that CRC-32 and FNV-1a are superior for hashing short strings. They are efficient and produced widely distributed and collision free hashes in my tests. I was surprised to find that MD5, SHA-1 and SHA-3 produced small numbers of collisions when the output was folded down to 32-bits.

Solution 4 - Algorithm

That depends on your hardware. On modern hardware, i.e. Intel/AMD with SSE4.2 or arm7 you should use the internal _mm_crc32_uxx intrinsics, as they are optimal for short strings. (For long keys also, but then better use Adler's threaded version, as in zlib)

On old or unknown hardware, either run-time probe for the SSE4.2 or CRC32 feature or just use one if the simple good hash functions. E.g. Murmur2 or City

An overview of quality and performance is here: https://github.com/rurban/smhasher#smhasher

There are also all the implementations. Favored are https://github.com/rurban/smhasher/blob/master/crc32_hw.c and https://github.com/rurban/smhasher/blob/master/MurmurHash2.cpp

If you know the keys in advance, use a perfect hash, not a hash function. E.g. gperf or my phash: https://github.com/rurban/Perfect-Hash#name

Nowadays perfect hash generation via a c compiler is so fast, you can even create them on the fly, and dynaload it.

Solution 5 - Algorithm

You might check out murmurhash2. It is fast, also for small strings, and has a good mixing final step so it is even good mixed for very small strings.

Solution 6 - Algorithm

Use MaPrime2c hash function:

static const unsigned char sTable[256] =
{
  0xa3,0xd7,0x09,0x83,0xf8,0x48,0xf6,0xf4,0xb3,0x21,0x15,0x78,0x99,0xb1,0xaf,0xf9,
  0xe7,0x2d,0x4d,0x8a,0xce,0x4c,0xca,0x2e,0x52,0x95,0xd9,0x1e,0x4e,0x38,0x44,0x28,
  0x0a,0xdf,0x02,0xa0,0x17,0xf1,0x60,0x68,0x12,0xb7,0x7a,0xc3,0xe9,0xfa,0x3d,0x53,
  0x96,0x84,0x6b,0xba,0xf2,0x63,0x9a,0x19,0x7c,0xae,0xe5,0xf5,0xf7,0x16,0x6a,0xa2,
  0x39,0xb6,0x7b,0x0f,0xc1,0x93,0x81,0x1b,0xee,0xb4,0x1a,0xea,0xd0,0x91,0x2f,0xb8,
  0x55,0xb9,0xda,0x85,0x3f,0x41,0xbf,0xe0,0x5a,0x58,0x80,0x5f,0x66,0x0b,0xd8,0x90,
  0x35,0xd5,0xc0,0xa7,0x33,0x06,0x65,0x69,0x45,0x00,0x94,0x56,0x6d,0x98,0x9b,0x76,
  0x97,0xfc,0xb2,0xc2,0xb0,0xfe,0xdb,0x20,0xe1,0xeb,0xd6,0xe4,0xdd,0x47,0x4a,0x1d,
  0x42,0xed,0x9e,0x6e,0x49,0x3c,0xcd,0x43,0x27,0xd2,0x07,0xd4,0xde,0xc7,0x67,0x18,
  0x89,0xcb,0x30,0x1f,0x8d,0xc6,0x8f,0xaa,0xc8,0x74,0xdc,0xc9,0x5d,0x5c,0x31,0xa4,
  0x70,0x88,0x61,0x2c,0x9f,0x0d,0x2b,0x87,0x50,0x82,0x54,0x64,0x26,0x7d,0x03,0x40,
  0x34,0x4b,0x1c,0x73,0xd1,0xc4,0xfd,0x3b,0xcc,0xfb,0x7f,0xab,0xe6,0x3e,0x5b,0xa5,
  0xad,0x04,0x23,0x9c,0x14,0x51,0x22,0xf0,0x29,0x79,0x71,0x7e,0xff,0x8c,0x0e,0xe2,
  0x0c,0xef,0xbc,0x72,0x75,0x6f,0x37,0xa1,0xec,0xd3,0x8e,0x62,0x8b,0x86,0x10,0xe8,
  0x08,0x77,0x11,0xbe,0x92,0x4f,0x24,0xc5,0x32,0x36,0x9d,0xcf,0xf3,0xa6,0xbb,0xac,
  0x5e,0x6c,0xa9,0x13,0x57,0x25,0xb5,0xe3,0xbd,0xa8,0x3a,0x01,0x05,0x59,0x2a,0x46
};


#define PRIME_MULT 1717


unsigned int
maPrime2cHash (unsigned char *str, unsigned int len)
{
  unsigned int hash = len, i;


  for (i = 0; i != len; i++, str++)
    {

      hash ^= sTable[( *str + i) & 255];
      hash = hash * PRIME_MULT;
    }

  return hash;
}

and look at www.amsoftware.narod.ru/algo2.html for MaFastPrime, MaRushPrime, etc tests.

Solution 7 - Algorithm

If it's rare that users add new tags, then you can use a perfect hash (http://en.wikipedia.org/wiki/Perfect_hash_function) that's recomputed each time a new tag is added. Of course, without knowing the problem you are really trying to solve, it's guesswork to figure out what you might do.

Solution 8 - Algorithm

If your program needs communicate with other system, it is better to use a algorithm which is well known. The quick & dirty way is using first Several characters of md5 hash. You don't need spend hours or days to invent wheels in your project.

The disadvantage is get much much high chance to collisions. However, if your hash is for a time-stamped session, or short life circule task. There is no problem to use that.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAndrey ShchekinView Question on Stackoverflow
Solution 1 - AlgorithmNick DandoulakisView Answer on Stackoverflow
Solution 2 - AlgorithmNick JohnsonView Answer on Stackoverflow
Solution 3 - AlgorithmgfkeoghView Answer on Stackoverflow
Solution 4 - AlgorithmrurbanView Answer on Stackoverflow
Solution 5 - AlgorithmRitsaert HornstraView Answer on Stackoverflow
Solution 6 - AlgorithmAlexanderView Answer on Stackoverflow
Solution 7 - Algorithmuser97370View Answer on Stackoverflow
Solution 8 - AlgorithmYi JiangView Answer on Stackoverflow