Similarity Distances for Natural Language Processing

Flavien Vidal
8 min readApr 28, 2021

--

source: _dChris via flickr (CC BY 2.0)

Importance of having good text representation:

Feature engineering and selection is surely one of the most crucial steps of any machine learning project. No matter the algorithm you choose to use, if the feature you giving it are bad, the results you are going to get will be bad too. It is well summarized by the expression: “garbage in, garbage out”. Feature engineering is only optimal if we have a good knowledge of the problem and it strongly depends on the data in question. However, some recurrent techniques are widely used in practice and can be helpful in many different problems. In this article, we focus on the case of textual data and address the question: how to handle feature engineering with textual data?

Vector Space Model
In order for a computer to work with text data, it should be converted into some mathematical form. Therefore, we usually represent text units (characters, words, sentences, paragraphs, documents, …) with vectors.
The idea is to represent these text units as vectors of numbers that will enable many operations such as information retrieval, document clustering, language translation or text summarization. Thus, the vector space model we choose for a given problem should enable us to easily compare documents by measuring the distance between their corresponding vectors, that is to say their similarity. There are multiple ways to compute vectors that capture the structure or semantics of text units.
Before deep diving into the different methods of text representation, it is necessary to start by discussing some techniques for measuring similarity between text units.

Quantifying similarity between text units:

There are a wide variety of similarity metrics that can be used for quantifying the similarity between text units, but each of them do not share the same definition of what “similar” means. Depending on the problem we face and depending on the type of resemblance we want to catch between units we may choose one metric over another.

Some of the most common ways to capture similarity between text units are:
- Longest Common Substring (LCS),
- Levenshtein Edit Distance,
- Hamming Distance,
- Cosine Similarity,
- Jaccard Distance,
- Euclidean Distance,

  • Longest Common Substring (LCS):

The LCS is a common example of character-based similarity measure. Given two strings, s1 of length n1 and s2 of length n2, it simply considers the length of the longest string (or strings) which is substring of both s1 and s2.

Applications: data deduplication and plagiarism detection.

Implementation Proposition of Longest Common Substring

Example: the LCS of strings ‘Lydia’ and ‘Lydoko’ simply returns 3.

  • Levenshtein Edit Distance (V. Levenshtein, 1965):

The Levenshtein distance is another common example of character-based similarity measure. It quantifies how dissimilar two text units are to one another by computing the minimum number of single-character edits (replacement, deletion and insertion operations) required to convert text unit 1 into text unit 2. Mathematically, it can be written as:

Properties:

Applications:
- in approximate string matching where the objective is to find matches for short strings in longer texts (spelling correction, OCR correction systems, …),
- in bioinformatics to quantify the similarity of DNA sequences (which can be viewed as strings of letters A, C, G and T),
- and it can also be used in music to measure the similarity of melodies.

Implementation Proposition of Levenshtein Distance

Example: changing ‘hubert’ into ‘uber’ would only need 2 operations, thus lev(‘hubert’, ‘uber’) = 2.

NB: different definitions of edit distances use different sets of operations. The Longest common subsequence distance allows only insertion and deletion but not substitution. Hamming distance allows only substitution and therefore only applies to strings of same lengths. Damerau–Levenshtein distance allows insertion, deletion, replacement and transposition of two adjacent characters. Jaro distance only allows transposition.

  • Hamming Distance (R. Hamming, 1950):

Another character-based similarity measure is the Hamming distance. Hamming distance between two equal size strings measures the minimum number of replacements required to change one string into the other. Mathematically, this can be written as follows:

Properties:

Application: mainly used in coding theory for error detection and correction (note that the following representation also exhibits the fact that the Hamming distance of binary chains is equivalent to the Manhattan distance between vertices).

3-bit binary cube (left) and 4-bit tesseract (right) for finding Hamming distances
Implementation Proposition of Hamming Distance

Example: the Hamming distance between strings ‘Lydia’ and ‘Media’ is 2.

  • Cosine Similarity:

The cosine similarity measures the proximity between two non-zero vectors of a pre-Hilbert space. The cosine similarity of two text units simply computes the cosine of the angle formed by the two vectors representing the text units, i.e. the inner product of Euclidian space of the normalized vectors. When close to 1, the two units are close in the chosen vector space, when close to -1, the two units are far apart.

Therefore, this metric is only an appreciation of orientation and not of magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors oriented at 90° to each other have a similarity of 0, and two diametrically opposed vectors have a similarity of -1.

Example of cosine similarities in a 2-Dimension space
  • Jaccard Distance:

The Jaccard distance measures how dissimilar two multisets are: the lower the distance, the more similar the two multisets. It is computed using the Jaccard index (or Jaccard similarity coefficient) which is the ratio of the cardinal of the intersection of the multisets to the cardinal of their union. The distance is then obtained by subtracting the index from 1. Mathematically, it can be written as:

These expressions are undefined if both A and B are empty sets, in which case we define the index and distance to be 1 and 0 respectively.

The Jaccard distance works quite well in practice, especially for sparse data. For example, a streaming service like Netflix could represent customers as multisets of movies watched, and then use Jaccard distance to measure the similarity between two customers, i.e. how close their tastes are. Then, based on the preferences of two users and their similarity we could potentially make recommendations to one or the other.
Similarly, if we represent documents in terms of the multisets of words they contain, then the Jaccard distance between two of the documents is often a reasonable measure of their similarity. In this case we would represent the multisets A and B as vectors va and vb with the i-th index of vector va equalling the number of times that the i-th element is represented in A:

Jaccard distance used to compare text units represented as BoW will typically present some flaws: as the size of the document increases, the number of common words tend to increase even if the documents talk about different topics. Moreover, this metric will not be able to capture the similarity between different text units that have the same meaning but are written differently (this issue is more an issue of text representation but since Jaccard distance is particularly well suited for BoW strategy, it still becomes a concern). For example these two text blobs have the same meaning but there Jaccard distance will be close to 1 since they do not share any common words:
Text unit 1: President greets the press in Chicago
Text unit 2: Obama speaks in Illinois
Another concern is the sense of the sentence:
Text unit 1: The dog bites the man
Text unit 2: The man bites the dog
Although these two units have totally different meaning, the Jaccard distance will consider them as being the most similar.

  • Euclidean Distance

Euclidean distance is a token-based similarity distance. Given two points in Rn, the Euclidean distance metric is simply the length of a line segment between these two points. It is also often referred as the l2-norm, the l2-distance or the Pythagorean metric and can be expressed as:

Properties:
The Euclidean distance is symmetric, positive and obeys the triangle inequality.

One of the important properties of this norm, relative to other norms, is that it remains unchanged under arbitrary rotations of space around the origin.

Applications:
- in Euclidean geometry, to find the shortest distance between two points in a Euclidean space,
- in clustering algorithms such as K-means,
- in statistics to measure the similarity between data points, or to use in methods such as least squares (where we use the squared Euclidean distance because it allows convex analysis)

More generally, the Minkowski or lp distance is a generalization of the Euclidean distance:

It is equal to the Manhattan distance if p=1, equal to the Euclidean distance if p=2, and equal to the Chebyshev distance (maximum or L∞ metric) if p approaches infinity.

Conclusion

Now that you have an overview of some of the most common distance metrics used in NLP, you can choose to continue working on other metrics or work on text vectorization to add to your knowledge.
I hope you enjoyed this article! Please feel free to contact me if you have any questions or if you feel that additional explanations should be added to this article.

Thanks for reading!
*All images (including formulas)are by the author except where stated otherwise

--

--