Good Hash Function for Strings

JavaHashHashtableHashcode

Java Problem Overview


I'm trying to think up a good hash function for strings. And I was thinking it might be a good idea to sum up the unicode values for the first five characters in the string (assuming it has five, otherwise stop where it ends). Would that be a good idea, or is it a bad one?

I am doing this in Java, but I wouldn't imagine that would make much of a difference.

Java Solutions


Solution 1 - Java

Usually hashes wouldn't do sums, otherwise stop and pots will have the same hash.

and you wouldn't limit it to the first n characters because otherwise house and houses would have the same hash.

Generally hashs take values and multiply it by a prime number (makes it more likely to generate unique hashes) So you could do something like:

int hash = 7;
for (int i = 0; i < strlen; i++) {
    hash = hash*31 + charAt(i);
}

Solution 2 - Java

If it's a security thing, you could use Java crypto:

import java.security.MessageDigest;

MessageDigest messageDigest = MessageDigest.getInstance("SHA-256");
messageDigest.update(stringToHash.getBytes());
String stringHash = new String(messageDigest.digest());

Solution 3 - Java

You should probably use String.hashCode().

If you really want to implement hashCode yourself:

> Do not be tempted to exclude > significant parts of an object from > the hash code computation to improve > performance -- Joshua Bloch, Effective Java

Using only the first five characters is a bad idea. Think about hierarchical names, such as URLs: they will all have the same hash code (because they all start with "http://", which means that they are stored under the same bucket in a hash map, exhibiting terrible performance.

Here's a war story paraphrased on the String hashCode from "Effective Java":

> The String hash function implemented > in all releases prior to 1.2 examined > at most sixteen characters, evenly > spaced throughout the string, starting > with the first character. For large > collections of hierarchical names, > such as URLs, this hash function > displayed terrible behavior.

Solution 4 - Java

If you are doing this in Java then why are you doing it? Just call .hashCode() on the string

Solution 5 - Java

Guava's HashFunction (javadoc) provides decent non-crypto-strong hashing.

Solution 6 - Java

This function provided by Nick is good but if you use new String(byte[] bytes) to make the transformation to String, it failed. You can use this function to do that.

private static final char[] hex = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f' };

public static String byteArray2Hex(byte[] bytes) {
	StringBuffer sb = new StringBuffer(bytes.length * 2);
	for(final byte b : bytes) {
		sb.append(hex[(b & 0xF0) >> 4]);
		sb.append(hex[b & 0x0F]);
	}
	return sb.toString();
}

public static String getStringFromSHA256(String stringToEncrypt) throws NoSuchAlgorithmException {
	MessageDigest messageDigest = MessageDigest.getInstance("SHA-256");
	messageDigest.update(stringToEncrypt.getBytes());
	return byteArray2Hex(messageDigest.digest());
}

May be this can help somebody

Solution 7 - Java

FNV-1 is rumoured to be a good hash function for strings.

For long strings (longer than, say, about 200 characters), you can get good performance out of the MD4 hash function. As a cryptographic function, it was broken about 15 years ago, but for non cryptographic purposes, it is still very good, and surprisingly fast. In the context of Java, you would have to convert the 16-bit char values into 32-bit words, e.g. by grouping such values into pairs. A fast implementation of MD4 in Java can be found in sphlib. Probably overkill in the context of a classroom assignment, but otherwise worth a try.

Solution 8 - Java

// djb2 hash function
unsigned long hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

source Logic behind djb2 hash function - SO

Solution 9 - Java

If you want to see the industry standard implementations, I'd look at java.security.MessageDigest.

"Message digests are secure one-way hash functions that take arbitrary-sized data and output a fixed-length hash value."

Solution 10 - Java

sdbm:this algorithm was created for sdbm (a public-domain reimplementation of ndbm) database library

static unsigned long sdbm(unsigned char *str)
{   
    unsigned long hash = 0;
    int c;
    while (c = *str++)
            hash = c + (hash << 6) + (hash << 16) - hash;

    return hash;
}

Solution 11 - Java

         public String hashString(String s) throws NoSuchAlgorithmException {
	byte[] hash = null;
	try {
		MessageDigest md = MessageDigest.getInstance("SHA-256");
		hash = md.digest(s.getBytes());
                    
	} catch (NoSuchAlgorithmException e) { e.printStackTrace(); }
	StringBuilder sb = new StringBuilder();
	for (int i = 0; i < hash.length; ++i) {
		String hex = Integer.toHexString(hash[i]);
		if (hex.length() == 1) {
			sb.append(0);
			sb.append(hex.charAt(hex.length() - 1));
		} else {
			sb.append(hex.substring(hex.length() - 2));
		}
	}
	return sb.toString();
}
   

Solution 12 - Java

This will avoid any collision and it will be fast until we use the shifting in calculations.

 int k = key.length();
    int sum = 0;
    for(int i = 0 ; i < k-1 ; i++){
        sum += key.charAt(i)<<(5*i);
    }

Solution 13 - Java

Its a good idea to work with odd number when trying to develop a good hast function for string. this function takes a string and return a index value, so far its work pretty good. and has less collision. the index ranges from 0 - 300 maybe even more than that, but i haven't gotten any higher so far even with long words like "electromechanical engineering"

int keyHash(string key)
{
	unsigned int k = (int)key.length();
 	unsigned int u = 0,n = 0;

	for (Uint i=0; i<k; i++)
	{
 		n = (int)key[i];
		u += 7*n%31;
	}
	return u%139;
}

another thing you can do is multiplying each character int parse by the index as it increase like the word "bear" (0b) + (1e) + (2a) + (3r) which will give you an int value to play with. the first hash function above collide at "here" and "hear" but still great at give some good unique values. the one below doesn't collide with "here" and "hear" because i multiply each character with the index as it increases.

int keyHash(string key)
{
	unsigned int k = (int)key.length();
 	unsigned int u = 0,n = 0;

	for (Uint i=0; i<k; i++)
	{
 		n = (int)key[i];
		u += i*n%31;
	}
	return u%139;
}

Solution 14 - Java

Here's a simple hash function that I use for a hash table I built. Its basically for taking a text file and stores every word in an index which represents the alphabetical order.

int generatehashkey(const char *name)
{
        int x = tolower(name[0])- 97;
        if (x < 0 || x > 25)
           x = 26;
        return x;
}

What this basically does is words are hashed according to their first letter. So, word starting with 'a' would get a hash key of 0, 'b' would get 1 and so on and 'z' would be 25. Numbers and symbols would have a hash key of 26. THere is an advantage this provides; You can calculate easily and quickly where a given word would be indexed in the hash table since its all in an alphabetical order, something like this: Code can be found here: https://github.com/abhijitcpatil/general

> Giving the following text as input: Atticus said to Jem one day, “I’d rather you shot at tin cans in the backyard, but I know you’ll go > after birds. Shoot all the blue jays you want, if you can hit ‘em, but > remember it’s a sin to kill a mockingbird.” That was the only time I > ever heard Atticus say it was a sin to do something, and I asked Miss > Maudie about it. “Your father’s right,” she said. “Mockingbirds don’t > do one thing except make music for us to enjoy. They don’t eat up > people’s gardens, don’t nest in corn cribs, they don’t do one thing > but sing their hearts out for us. That’s why it’s a sin to kill a > mockingbird.

This would be the output:

0 --> a a about asked and a Atticus a a all after at Atticus
1 --> but but blue birds. but backyard
2 --> cribs corn can cans
3 --> do don’t don’t don’t do don’t do day
4 --> eat enjoy. except ever
5 --> for for father’s
6 --> gardens go
7 --> hearts heard hit
8 --> it’s in it. I it I it’s if I in
9 --> jays Jem
10 --> kill kill know
11 --> 
12 --> mockingbird. music make Maudie Miss mockingbird.”
13 --> nest
14 --> out one one only one
15 --> people’s
16 --> 17 --> right remember rather
18 --> sin sing said. she something sin say sin Shoot shot said
19 --> to That’s their thing they They to thing to time the That to the the tin to
20 --> us. up us
21 --> 
22 --> why was was want
23 --> 
24 --> you you you’ll you
25 --> 
26 --> “Mockingbirds ” “Your ‘em “I’d

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionLeif AndersenView Question on Stackoverflow
Solution 1 - JavajonathanasdfView Answer on Stackoverflow
Solution 2 - JavaNickView Answer on Stackoverflow
Solution 3 - JavaFrederikView Answer on Stackoverflow
Solution 4 - JavaPyrolisticalView Answer on Stackoverflow
Solution 5 - JavaMike SamuelView Answer on Stackoverflow
Solution 6 - JavaFestus TamakloeView Answer on Stackoverflow
Solution 7 - JavaThomas PorninView Answer on Stackoverflow
Solution 8 - JavaPratik DeoghareView Answer on Stackoverflow
Solution 9 - JavaDean JView Answer on Stackoverflow
Solution 10 - JavaAnchalView Answer on Stackoverflow
Solution 11 - JavaCharaf JRAView Answer on Stackoverflow
Solution 12 - Javakamal el-deen shairView Answer on Stackoverflow
Solution 13 - JavakanthonyeView Answer on Stackoverflow
Solution 14 - Javauser2311285View Answer on Stackoverflow