What are some algorithms for comparing how similar two strings are?

AlgorithmLanguage AgnosticString ComparisonStdstringHeuristics

Algorithm Problem Overview


I need to compare strings to decide whether they represent the same thing. This relates to case titles entered by humans where abbreviations and other small details may differ. For example, consider the following two titles:

std::string first = "Henry C. Harper v. The Law Offices of Huey & Luey, LLP";

As opposed to:

std::string second = "Harper v. The Law Offices of Huey & Luey, LLP";

A human can quickly gauge that these are most likely one and the same. The current approach I have taken is to normalize the strings by lowercasing all letters and removing all punctuation and spaces giving:

std::string firstNormalized = "henrycharpervthelawofficesofhueylueyllp";

And:

std::string secondNormalized = "harpervthelawofficesofhueylueyllp";

Comparing in this case, one is a sub-sequence of the other, but you can imagine other more complex variations where that does not necessarily occur, yet they have significant sub-sequences in common. There could also be occasional human entry errors such as transposed letters and spelling errors.

Perhaps some kind of character diff program could help? I've seen good line diff programs for comparing differences in code to be checked in, is there something like that on a character basis, maybe in boost? If you could count the number of consecutive characters in common and take the ratio to the characters unshared, perhaps that would be a good heuristic?

In the end, I need a Boolean decision as to whether to consider them the same or not. It doesn't have to be perfect, but it should ideally rarely be wrong.

What algorithm can I use that will give me some kind of quantification as to how similar the two strings are to each other which I can then convert into a yes/no answer by way of some heuristic?

Algorithm Solutions


Solution 1 - Algorithm

What you're looking for are called String Metric algorithms. There a significant number of them, many with similar characteristics. Among the more popular:

  • Levenshtein Distance : The minimum number of single-character edits required to change one word into the other. Strings do not have to be the same length
  • Hamming Distance : The number of characters that are different in two equal length strings.
  • Smith–Waterman : A family of algorithms for computing variable sub-sequence similarities.
  • Sørensen–Dice Coefficient : A similarity algorithm that computes difference coefficients of adjacent character pairs.

Have a look at these as well as others on the wiki page on the topic.

Solution 2 - Algorithm

Damerau Levenshtein distance is another algorithm for comparing two strings and it is similar to the Levenshtein distance algorithm. The difference between the two is that it can also check transpositions between characters and hence may give a better result for error correction.

For example: The Levenshtein distance between night and nigth is 2 but Damerau Levenshtein distance between night and nigth will be 1 because it is just a swap of a pair of characters.

Solution 3 - Algorithm

You could use ngrams for that. For example, transform the two strings in word trigrams (usually lowercase) and compare the percentage of them that are equal to one another.

Your challenge is to define a minimum percentage for similarity.

http://en.wikipedia.org/wiki/N-gram

Solution 4 - Algorithm

Another algorithm that you can consider is the Simon White Similarity:

def get_bigrams(string):
    """
    Take a string and return a list of bigrams.
    """
    if string is None:
        return ""

    s = string.lower()
    return [s[i : i + 2] for i in list(range(len(s) - 1))]
def simon_similarity(str1, str2):
    """
    Perform bigram comparison between two strings
    and return a percentage match in decimal form.
    """
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union = len(pairs1) + len(pairs2)

    if union == 0 or union is None:
        return 0

    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

Solution 5 - Algorithm

You may use the algorithm of computing the length of the longest common sub-sequence to solve the problem. If the length of the longest common sub-sequence for both the input strings is less than the length of either of the strings, they are unequal.

You may use the approach of dynamic programming to solve the problem and optimize the space complexity as well in case you don't wish to figure out the longest common sub-sequence.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionWilliamKFView Question on Stackoverflow
Solution 1 - AlgorithmDaniel FreyView Answer on Stackoverflow
Solution 2 - AlgorithmAnkit ChaurasiaView Answer on Stackoverflow
Solution 3 - AlgorithmnodermanView Answer on Stackoverflow
Solution 4 - AlgorithmAdithya BharadwajView Answer on Stackoverflow
Solution 5 - Algorithmnmg_vikasView Answer on Stackoverflow