String similarity metrics in Python

PythonAlgorithmStringLevenshtein Distance

Python Problem Overview


I want to find string similarity between two strings. This page has examples of some of them. Python has an implemnetation of Levenshtein algorithm. Is there a better algorithm, (and hopefully a python library), under these contraints.

  1. I want to do fuzzy matches between strings. eg matches('Hello, All you people', 'hello, all You peopl') should return True
  2. False negatives are acceptable, False positives, except in extremely rare cases are not.
  3. This is done in a non realtime setting, so speed is not (much) of concern.
  4. [Edit] I am comparing multi word strings.

Would something other than Levenshtein distance(or Levenshtein ratio) be a better algorithm for my case?

Python Solutions


Solution 1 - Python

I realize it's not the same thing, but this is close enough:

>>> import difflib
>>> a = 'Hello, All you people'
>>> b = 'hello, all You peopl'
>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())
>>> seq.ratio()
0.97560975609756095

You can make this as a function

def similar(seq1, seq2):
    return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9

>>> similar(a, b)
True
>>> similar('Hello, world', 'Hi, world')
False

Solution 2 - Python

There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.

http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

Here's a bit of the list:

  • Hamming distance
  • Levenshtein distance
  • Needleman-Wunch distance or Sellers Algorithm
  • and many more...

Solution 3 - Python

This snippet will calculate the difflib, Levenshtein, Sørensen, and Jaccard similarity values for two strings. In the snippet below, I was iterating over a tsv in which the strings of interest occupied columns [3] and [4] of the tsv. (pip install python-Levenshtein and pip install distance):

import codecs, difflib, Levenshtein, distance

with codecs.open("titles.tsv","r","utf-8") as f:
	title_list = f.read().split("\n")[:-1]

	for row in title_list:

		sr      = row.lower().split("\t")

		diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
		lev     = Levenshtein.ratio(sr[3], sr[4]) 
		sor     = 1 - distance.sorensen(sr[3], sr[4])
		jac     = 1 - distance.jaccard(sr[3], sr[4])

		print diffl, lev, sor, jac

Solution 4 - Python

I would use Levenshtein distance, or the so-called Damerau distance (which takes transpositions into account) rather than the difflib stuff for two reasons (1) "fast enough" (dynamic programming algo) and "whoooosh" (bit-bashing) C code is available and (2) well-understood behaviour e.g. Levenshtein satisfies the triangle inequality and thus can be used in e.g. a Burkhard-Keller tree.

Threshold: you should treat as "positive" only those cases where distance < (1 - X) * max(len(string1), len(string2)) and adjust X (the similarity factor) to suit yourself. One way of choosing X is to get a sample of matches, calculate X for each, ignore cases where X < say 0.8 or 0.9, then sort the remainder in descending order of X and eye-ball them and insert the correct result and calculate some cost-of-mistakes measure for various levels of X.

N.B. Your ape/apple example has distance 2, so X is 0.6 ... I would only use a threshold as low as 0.75 if I were desperately looking for something and had a high false-negative penalty

Solution 5 - Python

Is that what you mean?

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

look at http://docs.python.org/library/difflib.html#difflib.get_close_matches

Solution 6 - Python

I know this isn't the same but you can adjust the ratio to filter out strings that are not similar enough and return the closest match to the string you are looking for.

Perhaps you would be more interested in semantic similarity metrics.

https://www.google.com/search?client=ubuntu&channel=fs&q=semantic+similarity+string+match&ie=utf-8&oe=utf-8

I realize you said speed is not an issue but if you are processing a lot of the strings for your algorithm the below is very helpful.

def spellcheck(self, sentence):
    #return ' '.join([difflib.get_close_matches(word, wordlist,1 , 0)[0] for word in sentence.split()])
    return ' '.join( [ sorted( { Levenshtein.ratio(x, word):x for x in wordlist }.items(), reverse=True)[0][1] for word in sentence.split() ] )

Its about 20 times faster than difflib.

https://pypi.python.org/pypi/python-Levenshtein/

import Levenshtein

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionagiliqView Question on Stackoverflow
Solution 1 - PythonNadia AlramliView Answer on Stackoverflow
Solution 2 - PythonariddellView Answer on Stackoverflow
Solution 3 - PythonduhaimeView Answer on Stackoverflow
Solution 4 - PythonJohn MachinView Answer on Stackoverflow
Solution 5 - PythonTzury Bar YochayView Answer on Stackoverflow
Solution 6 - PythonJohnView Answer on Stackoverflow