String similarity score/hash

Algorithm Problem Overview

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) hashes.

Let's consider these strings and scores as an example:

Hello world                1000
Hello world!               1010
Hello earth                1125
Foo bar                    3250
FooBarbar                  3750
Foo Bar!                   3300
Foo world!                 2350

You can see that Hello world! and Hello world are similar and their scores are close to each other.

This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.

Algorithm Solutions

Solution 1 - Algorithm

I believe what you're looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.

As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It's analogous to creating a flat map of the Earth... you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you're using to determine whether strings are "alike".

Solution 2 - Algorithm

Levenstein distance or its derivatives is the algorithm you want. Match given string to each of strings from dictionary. (Here, if you need only fixed number of most similar strings, you may want to use min-heap.) If running Levenstein distance for all strings in dictionary is too expensive, then use some rough algorithm first that will exclude too distant words from list of candidates. After that, run levenstein distance on left candidates.

One way to remove distant words is to index n-grams. Preprocess dictionary by splitting each of words into list of n-grams. For example, consider n=3:

(0) "Hello world" -> ["Hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"]
(1) "FooBarbar" -> ["Foo", "ooB", "oBa", "Bar", "arb", "rba", "bar"]
(2) "Foo world!" -> ["Foo", "oo ", "o w", " wo", "wor", "orl", "rld", "ld!"]

Next, create index of n-gramms:

" wo" -> [0, 2]
"Bar" -> [1]
"Foo" -> [1, 2]
"Hel" -> [0]
"arb" -> [1]
"bar" -> [1]
"ell" -> [0]
"ld!" -> [2]
"llo" -> [0]
"lo " -> [0]
"o w" -> [0, 2]
"oBa" -> [1]
"oo " -> [2]
"ooB" -> [1]
"orl" -> [0, 2]
"rba" -> [1]
"rld" -> [0, 2]
"wor" -> [0, 2]

When you need to find most similar strings for given string, you split given string into n-grams and select only those words from dictionary which have at least one matching n-gram. This reduces number of candidates to reasonable amount and you may proceed with levenstein-matching given string to each of left candidates.

If your strings are long enough, you may reduce index size by using min-hashing technnique: you calculate ordinary hash for each of n-grams and use only K smallest hashes, others are thrown away.

P.S. this presentation seems like a good introduction to your problem.

Solution 3 - Algorithm

This isn't possible, in general, because the set of edit distances between strings forms a metric space, but not one with a fixed dimension. That means that you can't provide a mapping between strings and integers that preserves a distance measure between them.

For example, you cannot assign numbers to these three phrases:

one two
one six
two six

Such that the numbers reflect the difference between all three phrases.

Solution 4 - Algorithm

While the idea seems extremely sweet... I've never heard of this.

I've read many, many, technics, thesis, and scientific papers on the subject of spell correction / typo correction and the fastest proposals revolve around an index and the levenshtein distance.

There are fairly elaborated technics, the one I am currently working on combines:

A Bursted Trie, with level compactness
A Levenshtein Automaton

Even though this doesn't mean it is "impossible" to get a score, I somehow think there would not be so much recent researches on string comparisons if such a "scoring" method had proved efficient.

If you ever find such a method, I am extremely interested :)

Solution 5 - Algorithm

Would Levenshtein distance work for you?

Solution 6 - Algorithm

In an unbounded problem, there is no solution which can convert any possible sequence of words, or any possible sequence of characters to a single number which describes locality.

Imagine similarity at the character level

stops
spots

hello world
world hello

In both examples the messages are different, but the characters in the message are identical, so the measure would need to hold a position value , as well as a character value. (char 0 == 'h', char 1 == 'e' ...)

Then compare the following similar messages

hello world
ello world

Although the two strings are similar, they could differ at the beginning, or at the end, which makes scaling by position problematic.

In the case of

spots
stops

The words only differ by position of the characters, so some form of position is important.

If the following strings are similar

 yesssssssssssssss
 yessssssssssssss

Then you have a form of paradox. If you add 2 s characters to the second string, it should share the distance it was from the first string, but it should be distinct. This can be repeated getting progressively longer strings, all of which need to be close to the strings just shorter and longer than them. I can't see how to achieve this.

In general this is treated as a multi-dimensional problem - breaking the string into a vector

[ 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd' ]

But the values of the vector can not be

represented by a fixed size number, or
give good quality difference measure.

If the number of words, or length of strings were bounded, then a solution of coding may be possible.

Bounded values

Using something like arithmetic compression, then a sequence of words can be converted into a floating point number which represents the sequence. However this would treat items earlier in the sequence as more significant than the last item in the sequence.

data mining solution

If you accept that the problem is high dimensional, then you can store your strings in a metric-tree wikipedia : metric tree. This would limit your search space, whilst not solving your "single number" solution.

I have code for such at github : clustering

Items which are close together, should be stored together in a part of the tree, but there is really no guarantee. The radius of subtrees is used to prune the search space.

Edit Distance or Levenshtein distance

This is used in a sqlite extension to perform similarity searching, but with no single number solution, it works out how many edits change one string into another. This then results in a score, which shows similarity.

Solution 7 - Algorithm

Your idea sounds like ontology but applied to whole phrases. The more similar two phrases are, the closer in the graph they are (assuming you're using weighted edges). And vice-versa: non similar phrases are very far from each other.

Another approach, is to use Fourier transform to get sort of the 'index' for a given string (it won't be a single number, but always). You may find little bit more in this paper.

And another idea, that bases on the Levenshtein distance: you may compare n-grams that will give you some similarity index for two given phrases - the more they are similar the value is closer to 1. This may be used to calculate distance in the graph. wrote a paper on this a few years ago, if you'd like I can share it.

Anyways: despite I don't know the exact solution, I'm also interested in what you'll came up with.

Solution 8 - Algorithm

Maybe use PCA, where the matrix is a list of the differences between the string and a fixed alphabet (à la ABCDEFGHI...). The answer could be simply the length of the principal component.

Just an idea.

ready-to-run PCA in C#

Solution 9 - Algorithm

I think of something like this:

remove all non-word characters
apply soundex

Solution 10 - Algorithm

It is unlikely one can get a rather small number from two phrases that, being compared, provide a relevant indication of the similarity of their initial phrases.
A reason is that the number gives an indication in one dimension, while phrases are evolving in two dimensions, length and intensity.

The number could evolve as well in length as in intensity but I'm not sure it'll help a lot.

In two dimensions, you better look at a matrix, which some properties like the determinant (a kind of derivative of the matrix) could give a rough idea of the phrase trend.

Solution 11 - Algorithm

In Natural Language Processing we have a thing call Minimum Edit Distance (also known as Levenshtein Distance)
Its basically defined as the smallest amount of operation needed in order to transform string1 to string2
Operations included Insertion, Deletion, Subsitution, each operation is given a score to which you add to the distance
The idea to solve your problem is to calculate the MED from your chosen string, to all the other string, sort that collection and pick out the n-th first smallest distance string
For example:

{"Hello World", "Hello World!", "Hello Earth"}
Choosing base-string="Hello World"  
Med(base-string, "Hello World!") = 1  
Med(base-string, "Hello Earth") = 8  
1st closest string is "Hello World!"

This have somewhat given a score to each string of your string-collection
C# Implementation (Add-1, Deletion-1, Subsitution-2)

public static int Distance(string s1, string s2)
{
    int[,] matrix = new int[s1.Length + 1, s2.Length + 1];

    for (int i = 0; i <= s1.Length; i++)
        matrix[i, 0] = i;
    for (int i = 0; i <= s2.Length; i++)
        matrix[0, i] = i;

    for (int i = 1; i <= s1.Length; i++)
    {
        for (int j = 1; j <= s2.Length; j++)
        {
            int value1 = matrix[i - 1, j] + 1;
            int value2 = matrix[i, j - 1] + 1;
            int value3 = matrix[i - 1, j - 1] + ((s1[i - 1] == s2[j - 1]) ? 0 : 2);

            matrix[i, j] = Math.Min(value1, Math.Min(value2, value3));
        }
    }

    return matrix[s1.Length, s2.Length];
}

Complexity O(n x m) where n, m is length of each string
More info on Minimum Edit Distance can be found here

Solution 12 - Algorithm

Well, you could add up the ascii value of each character and then compare the scores, having a maximum value on which they can differ. This does not guarantee however that they will be similar, for the same reason two different strings can have the same hash value.

You could of course make a more complex function, starting by checking the size of the strings, and then comparing each caracter one by one, again with a maximum difference set up.

Content Type	Original Author	Original Content on Stackoverflow
Question	Josef Sábl	View Question on Stackoverflow
Solution 1 - Algorithm	DougW	View Answer on Stackoverflow
Solution 2 - Algorithm	gudok	View Answer on Stackoverflow
Solution 3 - Algorithm	Nick Johnson	View Answer on Stackoverflow
Solution 4 - Algorithm	Matthieu M.	View Answer on Stackoverflow
Solution 5 - Algorithm	Karl Knechtel	View Answer on Stackoverflow
Solution 6 - Algorithm	mksteve	View Answer on Stackoverflow
Solution 7 - Algorithm	Przemek Kryger	View Answer on Stackoverflow
Solution 8 - Algorithm	smirkingman	View Answer on Stackoverflow
Solution 9 - Algorithm	alpha-mouse	View Answer on Stackoverflow
Solution 10 - Algorithm	Déjà vu	View Answer on Stackoverflow
Solution 11 - Algorithm	rocketspacer	View Answer on Stackoverflow
Solution 12 - Algorithm	Oscar Olim	View Answer on Stackoverflow