Similarity String Comparison in Java
JavaString ComparisonJava Problem Overview
I want to compare several strings to each other, and find the ones that are the most similar. I was wondering if there is any library, method or best practice that would return me which strings are more similar to other strings. For example:
- "The quick fox jumped" -> "The fox jumped"
- "The quick fox jumped" -> "The fox"
This comparison would return that the first is more similar than the second.
I guess I need some method such as:
double similarityIndex(String s1, String s2)
Is there such a thing somewhere?
EDIT: Why am I doing this? I am writing a script that compares the output of a MS Project file to the output of some legacy system that handles tasks. Because the legacy system has a very limited field width, when the values are added the descriptions are abbreviated. I want some semi-automated way to find which entries from MS Project are similar to the entries on the system so I can get the generated keys. It has drawbacks, as it has to be still manually checked, but it would save a lot of work
Java Solutions
Solution 1 - Java
The common way of calculating the similarity between two strings in a 0%-100% fashion, as used in many libraries, is to measure how much (in %) you'd have to change the longer string to turn it into the shorter:
/**
* Calculates the similarity (a number within 0 and 1) between two strings.
*/
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) { // longer should always have greater length
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below
###Computing the editDistance()
:
The editDistance()
function above is expected to calculate the edit distance between the two strings. There are several implementations to this step, each may suit a specific scenario better. The most common is the Levenshtein distance algorithm and we'll use it in our example below (for very large strings, other algorithms are likely to perform better).
Here's two options to calculate the edit distance:
- You can use Apache Commons Text's implementation of Levenshtein distance:
apply(CharSequence left, CharSequence rightt)
- Implement it in your own. Below you'll find an example implementation.
###Working example:
public class StringSimilarity {
/**
* Calculates the similarity (a number within 0 and 1) between two strings.
*/
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) { // longer should always have greater length
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
/* // If you have Apache Commons Text, you can use it to calculate the edit distance:
LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// Example implementation of the Levenshtein Edit Distance
// See http://rosettacode.org/wiki/Levenshtein_distance#Java
public static int editDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0)
costs[j] = j;
else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1))
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0)
costs[s2.length()] = lastValue;
}
return costs[s2.length()];
}
public static void printSimilarity(String s, String t) {
System.out.println(String.format(
"%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
}
public static void main(String[] args) {
printSimilarity("", "");
printSimilarity("1234567890", "1");
printSimilarity("1234567890", "123");
printSimilarity("1234567890", "1234567");
printSimilarity("1234567890", "1234567890");
printSimilarity("1234567890", "1234567980");
printSimilarity("47/2010", "472010");
printSimilarity("47/2010", "472011");
printSimilarity("47/2010", "AB.CDEF");
printSimilarity("47/2010", "4B.CDEFG");
printSimilarity("47/2010", "AB.CDEFG");
printSimilarity("The quick fox jumped", "The fox jumped");
printSimilarity("The quick fox jumped", "The fox");
printSimilarity("kitten", "sitting");
}
}
Output:
1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"
Solution 2 - Java
Yes, there are many well documented algorithms like:
- Cosine similarity
- Jaccard similarity
- Dice's coefficient
- Matching similarity
- Overlap similarity
- etc etc
A good summary ("Sam's String Metrics") can be found here (original link dead, so it links to Internet Archive)
Also check these projects:
Solution 3 - Java
I translated the Levenshtein distance algorithm into JavaScript:
String.prototype.LevenshteinDistance = function (s2) {
var array = new Array(this.length + 1);
for (var i = 0; i < this.length + 1; i++)
array[i] = new Array(s2.length + 1);
for (var i = 0; i < this.length + 1; i++)
array[i][0] = i;
for (var j = 0; j < s2.length + 1; j++)
array[0][j] = j;
for (var i = 1; i < this.length + 1; i++) {
for (var j = 1; j < s2.length + 1; j++) {
if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
else {
array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
}
}
}
return array[this.length][s2.length];
};
Solution 4 - Java
There are indeed a lot of string similarity measures out there:
- Levenshtein edit distance;
- Damerau-Levenshtein distance;
- Jaro-Winkler similarity;
- Longest Common Subsequence edit distance;
- Q-Gram (Ukkonen);
- n-Gram distance (Kondrak);
- Jaccard index;
- Sorensen-Dice coefficient;
- Cosine similarity;
- ...
You can find explanation and java implementation of these here: https://github.com/tdebatty/java-string-similarity
Solution 5 - Java
You could use Levenshtein distance to calculate the difference between two strings. http://en.wikipedia.org/wiki/Levenshtein_distance
Solution 6 - Java
You can achieve this using the apache commons java library. Take a look at these two functions within it:
Solution 7 - Java
Thank to the first answerer, I think there are 2 calculations of computeEditDistance(s1, s2). Due to high time spending of it, decided to improve the code's performance. So:
public class LevenshteinDistance {
public static int computeEditDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0) {
costs[j] = j;
} else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
}
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0) {
costs[s2.length()] = lastValue;
}
}
return costs[s2.length()];
}
public static void printDistance(String s1, String s2) {
double similarityOfStrings = 0.0;
int editDistance = 0;
if (s1.length() < s2.length()) { // s1 should always be bigger
String swap = s1;
s1 = s2;
s2 = swap;
}
int bigLen = s1.length();
editDistance = computeEditDistance(s1, s2);
if (bigLen == 0) {
similarityOfStrings = 1.0; /* both strings are zero length */
} else {
similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
}
//////////////////////////
//System.out.println(s1 + "-->" + s2 + ": " +
// editDistance + " (" + similarityOfStrings + ")");
System.out.println(editDistance + " (" + similarityOfStrings + ")");
}
public static void main(String[] args) {
printDistance("", "");
printDistance("1234567890", "1");
printDistance("1234567890", "12");
printDistance("1234567890", "123");
printDistance("1234567890", "1234");
printDistance("1234567890", "12345");
printDistance("1234567890", "123456");
printDistance("1234567890", "1234567");
printDistance("1234567890", "12345678");
printDistance("1234567890", "123456789");
printDistance("1234567890", "1234567890");
printDistance("1234567890", "1234567980");
printDistance("47/2010", "472010");
printDistance("47/2010", "472011");
printDistance("47/2010", "AB.CDEF");
printDistance("47/2010", "4B.CDEFG");
printDistance("47/2010", "AB.CDEFG");
printDistance("The quick fox jumped", "The fox jumped");
printDistance("The quick fox jumped", "The fox");
printDistance("The quick fox jumped",
"The quick fox jumped off the balcany");
printDistance("kitten", "sitting");
printDistance("rosettacode", "raisethysword");
printDistance(new StringBuilder("rosettacode").reverse().toString(),
new StringBuilder("raisethysword").reverse().toString());
for (int i = 1; i < args.length; i += 2) {
printDistance(args[i - 1], args[i]);
}
}
}
Solution 8 - Java
Theoretically, you can compare edit distances.
Solution 9 - Java
This is typically done using an edit distance measure. Searching for "edit distance java" turns up a number of libraries, like this one.
Solution 10 - Java
Sounds like a plagiarism finder to me if your string turns into a document. Maybe searching with that term will turn up something good.
"Programming Collective Intelligence" has a chapter on determining whether two documents are similar. The code is in Python, but it's clean and easy to port.
Solution 11 - Java
You can also use z algorithm to find similarity in the string. Click here https://teakrunch.com/2020/05/09/string-similarity-hackerrank-challenge/