Can someone give an example of cosine similarity, in a very simple, graphical way?

TextData MiningCosine Similarity

Text Problem Overview


Cosine Similarity article on Wikipedia

Can you show the vectors here (in a list or something) and then do the math, and let us see how it works?

Text Solutions


Solution 1 - Text

Here are two very short texts to compare:

  1. Julie loves me more than Linda loves me

  2. Jane likes me more than Julie loves me

We want to know how similar these texts are, purely in terms of word counts (and ignoring word order). We begin by making a list of the words from both texts:

me Julie loves Linda than more likes Jane

Now we count the number of times each of these words appears in each text:

   me   2   2
 Jane   0   1
Julie   1   1
Linda   1   0
likes   0   1
loves   2   1
 more   1   1
 than   1   1

We are not interested in the words themselves though. We are interested only in those two vertical vectors of counts. For instance, there are two instances of 'me' in each text. We are going to decide how close these two texts are to each other by calculating one function of those two vectors, namely the cosine of the angle between them.

The two vectors are, again:

a: [2, 0, 1, 1, 0, 2, 1, 1]

b: [2, 1, 1, 0, 1, 1, 1, 1]

The cosine of the angle between them is about 0.822.

These vectors are 8-dimensional. A virtue of using cosine similarity is clearly that it converts a question that is beyond human ability to visualise to one that can be. In this case you can think of this as the angle of about 35 degrees which is some 'distance' from zero or perfect agreement.

Solution 2 - Text

I'm guessing you are more interested in getting some insight into "why" the cosine similarity works (why it provides a good indication of similarity), rather than "how" it is calculated (the specific operations used for the calculation). If your interest is in the latter, see the reference indicated by Daniel in this post, as well as a related SO Question.

To explain both the how and even more so the why, it is useful, at first, to simplify the problem and to work only in two dimensions. Once you get this in 2D, it is easier to think of it in three dimensions, and of course harder to imagine in many more dimensions, but by then we can use linear algebra to do the numeric calculations and also to help us think in terms of lines / vectors / "planes" / "spheres" in n dimensions, even though we can't draw these.

So, in two dimensions: with regards to text similarity this means that we would focus on two distinct terms, say the words "London" and "Paris", and we'd count how many times each of these words is found in each of the two documents we wish to compare. This gives us, for each document, a point in the the x-y plane. For example, if Doc1 had Paris once, and London four times, a point at (1,4) would present this document (with regards to this diminutive evaluation of documents). Or, speaking in terms of vectors, this Doc1 document would be an arrow going from the origin to point (1,4). With this image in mind, let's think about what it means for two documents to be similar and how this relates to the vectors.

VERY similar documents (again with regards to this limited set of dimensions) would have the very same number of references to Paris, AND the very same number of references to London, or maybe, they could have the same ratio of these references. A Document, Doc2, with 2 refs to Paris and 8 refs to London, would also be very similar, only with maybe a longer text or somehow more repetitive of the cities' names, but in the same proportion. Maybe both documents are guides about London, only making passing references to Paris (and how uncool that city is ;-) Just kidding!!!.

Now, less similar documents may also include references to both cities, but in different proportions. Maybe Doc2 would only cite Paris once and London seven times.

Back to our x-y plane, if we draw these hypothetical documents, we see that when they are VERY similar, their vectors overlap (though some vectors may be longer), and as they start to have less in common, these vectors start to diverge, to have a wider angle between them.

By measuring the angle between the vectors, we can get a good idea of their similarity, and to make things even easier, by taking the Cosine of this angle, we have a nice 0 to 1 or -1 to 1 value that is indicative of this similarity, depending on what and how we account for. The smaller the angle, the bigger (closer to 1) the cosine value, and also the higher the similarity.

At the extreme, if Doc1 only cites Paris and Doc2 only cites London, the documents have absolutely nothing in common. Doc1 would have its vector on the x-axis, Doc2 on the y-axis, the angle 90 degrees, Cosine 0. In this case we'd say that these documents are orthogonal to one another.

Adding dimensions:
With this intuitive feel for similarity expressed as a small angle (or large cosine), we can now imagine things in 3 dimensions, say by bringing the word "Amsterdam" into the mix, and visualize quite well how a document with two references to each would have a vector going in a particular direction, and we can see how this direction would compare to a document citing Paris and London three times each, but not Amsterdam, etc. As said, we can try and imagine the this fancy space for 10 or 100 cities. It's hard to draw, but easy to conceptualize.

I'll wrap up just by saying a few words about the formula itself. As I've said, other references provide good information about the calculations.

First in two dimensions. The formula for the Cosine of the angle between two vectors is derived from the trigonometric difference (between angle a and angle b):

cos(a - b) = (cos(a) * cos(b)) + (sin (a) * sin(b))

This formula looks very similar to the dot product formula:

Vect1 . Vect2 =  (x1 * x2) + (y1 * y2)

where cos(a) corresponds to the x value and sin(a) the y value, for the first vector, etc. The only problem, is that x, y, etc. are not exactly the cos and sin values, for these values need to be read on the unit circle. That's where the denominator of the formula kicks in: by dividing by the product of the length of these vectors, the x and y coordinates become normalized.

Solution 3 - Text

Here's my implementation in C#.

using System;

namespace CosineSimilarity
{
    class Program
    {
        static void Main()
        {
            int[] vecA = {1, 2, 3, 4, 5};
            int[] vecB = {6, 7, 7, 9, 10};

            var cosSimilarity = CalculateCosineSimilarity(vecA, vecB);

            Console.WriteLine(cosSimilarity);
            Console.Read();
        }

        private static double CalculateCosineSimilarity(int[] vecA, int[] vecB)
        {
            var dotProduct = DotProduct(vecA, vecB);
            var magnitudeOfA = Magnitude(vecA);
            var magnitudeOfB = Magnitude(vecB);

            return dotProduct/(magnitudeOfA*magnitudeOfB);
        }

        private static double DotProduct(int[] vecA, int[] vecB)
        {
            // I'm not validating inputs here for simplicity.            
            double dotProduct = 0;
            for (var i = 0; i < vecA.Length; i++)
            {
                dotProduct += (vecA[i] * vecB[i]);
            }

            return dotProduct;
        }

        // Magnitude of the vector is the square root of the dot product of the vector with itself.
        private static double Magnitude(int[] vector)
        {
            return Math.Sqrt(DotProduct(vector, vector));
        }
    }
}

Solution 4 - Text

For simplicity I am reducing the vector a and b:

Let :
    a : [1, 1, 0]
    b : [1, 0, 1]

Then cosine similarity (Theta):

 (Theta) = (1*1 + 1*0 + 0*1)/sqrt((1^2 + 1^2))* sqrt((1^2 + 1^2)) = 1/2 = 0.5

then inverse of cos 0.5 is 60 degrees.

Solution 5 - Text

This Python code is my quick and dirty attempt to implement the algorithm:

import math
from collections import Counter

def build_vector(iterable1, iterable2):
    counter1 = Counter(iterable1)
    counter2 = Counter(iterable2)
    all_items = set(counter1.keys()).union(set(counter2.keys()))
    vector1 = [counter1[k] for k in all_items]
    vector2 = [counter2[k] for k in all_items]
    return vector1, vector2

def cosim(v1, v2):
    dot_product = sum(n1 * n2 for n1, n2 in zip(v1, v2) )
    magnitude1 = math.sqrt(sum(n ** 2 for n in v1))
    magnitude2 = math.sqrt(sum(n ** 2 for n in v2))
    return dot_product / (magnitude1 * magnitude2)


l1 = "Julie loves me more than Linda loves me".split()
l2 = "Jane likes me more than Julie loves me or".split()


v1, v2 = build_vector(l1, l2)
print(cosim(v1, v2))

Solution 6 - Text

Using @Bill Bell example, two ways to do this in [R]

a = c(2,1,0,2,0,1,1,1)

b = c(2,1,1,1,1,0,1,1)

d = (a %*% b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))

or taking advantage of crossprod() method's performance...

e = crossprod(a, b) / (sqrt(crossprod(a, a)) * sqrt(crossprod(b, b)))

Solution 7 - Text

This is a simple Python code which implements cosine similarity.

from scipy import linalg, mat, dot
import numpy as np

In [12]: matrix = mat( [[2, 1, 0, 2, 0, 1, 1, 1],[2, 1, 1, 1, 1, 0, 1, 1]] )

In [13]: matrix
Out[13]: 
matrix([[2, 1, 0, 2, 0, 1, 1, 1],
        [2, 1, 1, 1, 1, 0, 1, 1]])
In [14]: dot(matrix[0],matrix[1].T)/np.linalg.norm(matrix[0])/np.linalg.norm(matrix[1])
Out[14]: matrix([[ 0.82158384]])

Solution 8 - Text

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

/**
 * 
* @author Xiao Ma
* mail : [email protected]
*
*/
  public class SimilarityUtil {

public static double consineTextSimilarity(String[] left, String[] right) {
	Map<String, Integer> leftWordCountMap = new HashMap<String, Integer>();
	Map<String, Integer> rightWordCountMap = new HashMap<String, Integer>();
	Set<String> uniqueSet = new HashSet<String>();
	Integer temp = null;
	for (String leftWord : left) {
		temp = leftWordCountMap.get(leftWord);
		if (temp == null) {
			leftWordCountMap.put(leftWord, 1);
			uniqueSet.add(leftWord);
		} else {
			leftWordCountMap.put(leftWord, temp + 1);
		}
	}
	for (String rightWord : right) {
		temp = rightWordCountMap.get(rightWord);
		if (temp == null) {
			rightWordCountMap.put(rightWord, 1);
			uniqueSet.add(rightWord);
		} else {
			rightWordCountMap.put(rightWord, temp + 1);
		}
	}
	int[] leftVector = new int[uniqueSet.size()];
	int[] rightVector = new int[uniqueSet.size()];
	int index = 0;
	Integer tempCount = 0;
	for (String uniqueWord : uniqueSet) {
		tempCount = leftWordCountMap.get(uniqueWord);
		leftVector[index] = tempCount == null ? 0 : tempCount;
		tempCount = rightWordCountMap.get(uniqueWord);
		rightVector[index] = tempCount == null ? 0 : tempCount;
		index++;
	}
	return consineVectorSimilarity(leftVector, rightVector);
}

/**
 * The resulting similarity ranges from −1 meaning exactly opposite, to 1
 * meaning exactly the same, with 0 usually indicating independence, and
 * in-between values indicating intermediate similarity or dissimilarity.
 * 
 * For text matching, the attribute vectors A and B are usually the term
 * frequency vectors of the documents. The cosine similarity can be seen as
 * a method of normalizing document length during comparison.
 * 
 * In the case of information retrieval, the cosine similarity of two
 * documents will range from 0 to 1, since the term frequencies (tf-idf
 * weights) cannot be negative. The angle between two term frequency vectors
 * cannot be greater than 90°.
 * 
 * @param leftVector
 * @param rightVector
 * @return
 */
private static double consineVectorSimilarity(int[] leftVector,
		int[] rightVector) {
	if (leftVector.length != rightVector.length)
		return 1;
	double dotProduct = 0;
	double leftNorm = 0;
	double rightNorm = 0;
	for (int i = 0; i < leftVector.length; i++) {
		dotProduct += leftVector[i] * rightVector[i];
		leftNorm += leftVector[i] * leftVector[i];
		rightNorm += rightVector[i] * rightVector[i];
	}

	double result = dotProduct
			/ (Math.sqrt(leftNorm) * Math.sqrt(rightNorm));
	return result;
}

public static void main(String[] args) {
	String left[] = { "Julie", "loves", "me", "more", "than", "Linda",
			"loves", "me" };
	String right[] = { "Jane", "likes", "me", "more", "than", "Julie",
			"loves", "me" };
	System.out.println(consineTextSimilarity(left,right));
}
}

Solution 9 - Text

Simple JAVA code to calculate cosine similarity

/**
   * Method to calculate cosine similarity of vectors
   * 1 - exactly similar (angle between them is 0)
   * 0 - orthogonal vectors (angle between them is 90)
   * @param vector1 - vector in the form [a1, a2, a3, ..... an]
   * @param vector2 - vector in the form [b1, b2, b3, ..... bn]
   * @return - the cosine similarity of vectors (ranges from 0 to 1)
   */
  private double cosineSimilarity(List<Double> vector1, List<Double> vector2) {

    double dotProduct = 0.0;
    double normA = 0.0;
    double normB = 0.0;
    for (int i = 0; i < vector1.size(); i++) {
      dotProduct += vector1.get(i) * vector2.get(i);
      normA += Math.pow(vector1.get(i), 2);
      normB += Math.pow(vector2.get(i), 2);
    }
    return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
  }

Solution 10 - Text

Let me try explaining this in terms of Python code and some graphical Mathematics formulas.

Suppose we have two very short texts in our code:

texts = ["I am a boy", "I am a girl"] 

And we want to compare the following query text to see how close the query is to the texts above, using fast cosine similarity scores:

query = ["I am a boy scout"]

How should we compute the cosine similarity scores? First, let's build a tfidf matrix in Python for these texts:

from sklearn.feature_extraction.text import TfidfVectorizer
    
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)

Next, let's check the values of our tfidf matrix and its vocabulary:

print(tfidf_matrix.toarray())
# output 
array([[0.57973867, 0.81480247, 0.        ],
       [0.57973867, 0.        , 0.81480247]])

Here, we get a tfidf matrix with tfidf values of 2 x 3, or 2 documents/text x 3 terms. This is our tfidf document-term matrix. Let's see what are the 3 terms by calling vectorizer.vocabulary_

print(vectorizer.vocabulary_)
# output
{'am': 0, 'boy': 1, 'girl': 2}

This tells us that our 3 terms in our tfidf matrix are 'am', 'boy' and 'girl'. 'am' is at column 0, 'boy' is at column 1, and 'girl' is at column 2. The terms 'I' and 'a' has been removed by the vectorizer because they are stopwords.

Now we have our tfidf matrix, we want to compare our query text with our texts and see how close our query is to our texts. To do that, we can compute the cosine similarity scores of the query vs the tfidf matrix of the texts. But first, we need to compute the tfidf of our query:

query = ["I am a boy scout"]

query_tfidf = vectorizer.transform([query])
print(query_tfidf.toarray())
#output
array([[0.57973867, 0.81480247, 0.        ]])

Here, we computed the tfidf of our query. Our query_tfidf has a vector of tfidf values [0.57973867, 0.81480247, 0. ], which we will use to compute our cosine similarity multiplication scores. If I am not mistaken, the query_tfidf values or vectorizer.transform([query]) values are derived by just selecting the row or document from tfidf_matrix that has the most word matching with the query. For example, row 1 or document/text 1 of the tfidf_matrix has the most word matching with the query text which contains "am" (0.57973867) and "boy" (0.81480247), hence row 1 of the tfidf_matrix of [0.57973867, 0.81480247, 0. ] values are selected to be the values for query_tfidf. (Note: If someone could help further explain this that would be good)

After computing our query_tfidf, we can now matrix multiply or dot product our query_tfidf vector with our text tfidf_matrix to obtain the cosine similarity scores.

Recall that cosine similarity score or formula is equal to the following:

cosine similarity score = (A . B) / ||A|| ||B||

Here, A = our query_tfidf vector, and B = each row of our tfidf_matrix

Note that: A . B = A * B^T, or A dot product B = A multiply by B Transpose.

Knowing the formula, let's manually compute our cosine similarity scores for query_tfidf, then compare our answer with the values provided by the sklearn.metrics cosine_similarity function. Let's manually compute:

query_tfidf_arr = query_tfidf.toarray()
tfidf_matrix_arr = tfidf_matrix.toarray()

cosine_similarity_1 = np.dot(query_tfidf_arr, tfidf_matrix_arr[0].T) / 
  (np.linalg.norm(query_tfidf_arr) * np.linalg.norm(tfidf_matrix_arr[0])) 
cosine_similarity_2 = np.dot(query_tfidf_arr, tfidf_matrix_arr[1].T) / 
  (np.linalg.norm(query_tfidf_arr) * np.linalg.norm(tfidf_matrix_arr[1]))

manual_cosine_similarities = [cosine_similarity_1[0], cosine_similarity_2[0]]
print(manual_cosine_similarities)    
#output
[1.0, 0.33609692727625745]

Our manually computed cosine similarity scores give values of [1.0, 0.33609692727625745]. Let's check our manually computed cosine similarity score with the answer value provided by the sklearn.metrics cosine_similarity function:

from sklearn.metrics.pairwise import cosine_similarity

function_cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix)
print(function_cosine_similarities)
#output
array([[1.0        , 0.33609693]])

The output values are both the same! The manually computed cosine similarity values are the the same as the function computed cosine similarity values!

Hence, this simple explanation shows how the cosine similarity values are computed. Hope you found this explanation helpful.

Solution 11 - Text

Two Vectors A and B exists in a 2D space or 3D space, the angle between those vectors is cos similarity.

If the angle is more (can reach max 180 degree) which is Cos 180=-1 and the minimum angle is 0 degree. cos 0 =1 implies the vectors are aligned to each other and hence the vectors are similar.

cos 90=0 (which is sufficient to conclude that the vectors A and B are not similar at all and since distance cant be negative, the cosine values will lie from 0 to 1. Hence, more angle implies implies reducing similarity (visualising also it makes sense)

Solution 12 - Text

Here's a simple Python code to calculate cosine similarity:

import math

def dot_prod(v1, v2):
    ret = 0
    for i in range(len(v1)):
        ret += v1[i] * v2[i]
    return ret

def magnitude(v):
    ret = 0
    for i in v:
        ret += i**2
    return math.sqrt(ret)

def cos_sim(v1, v2):
    return (dot_prod(v1, v2)) / (magnitude(v1) * magnitude(v2))

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionTIMEXView Question on Stackoverflow
Solution 1 - TextBill BellView Answer on Stackoverflow
Solution 2 - TextmjvView Answer on Stackoverflow
Solution 3 - TexttranmqView Answer on Stackoverflow
Solution 4 - TextuserPSView Answer on Stackoverflow
Solution 5 - TextVictor YanView Answer on Stackoverflow
Solution 6 - TextAndrew UView Answer on Stackoverflow
Solution 7 - TextNilani AlgiriyageView Answer on Stackoverflow
Solution 8 - Textuser1472571View Answer on Stackoverflow
Solution 9 - Textnikoo28View Answer on Stackoverflow
Solution 10 - TextYi Xiang ChongView Answer on Stackoverflow
Solution 11 - TextDroolView Answer on Stackoverflow
Solution 12 - TextJake MorfeView Answer on Stackoverflow