Figuring out how similar two strings are and then making that similarity a quantitative measurement is a basic problem in text analysis, text mining and natural language processing. There are a number of efficient methods to solve this problem. This survey looks at Python implementations of a simple but widely used method: Levenshtein distance as a measure of edit distance.
Edit distance between two strings is the minimum total number of operations that change one string into the other ,,. Edit distance is zero if two strings are identical. Edit distance is positive if two strings are not identical. Edit distance does not consider the length of the strings being compared: two strings that are close in length and are different by one character are treated the same as two strings where one is much longer than the other . All the implementations I tested are case-sensitive.
Generally, edit operations are inserting, deleting and substituting characters. Some methods support transposing characters. Fuzzy string matching libraries such as fuzzywuzzy are typically based on edit distance methods.
Levenshtein distance computes edit distance with insert, delete and substitution operations. This is the most parsimonious set of operations to transform one string into another. Levenshtein distance can be computed recursively , but efficient implementations use dynamic programming solutions with a table for holding computed costs ,.
Here are some illustrated simple edit distance operations. The first four examples demonstrate the operations on single character and empty strings:
Next, we have a simple sequence of replace operations that transform “ab” to “ba”:
Both of these examples have edit distance equal to 3 because it takes three operations to transform one word to another:
Damerau-Levenshtein extends the Levenshtein distance method with an additional operation: transpose, where two adjacent characters can be swapped. Damerau-Levenshtein may compute smaller edit distances if adjacent transpose operations can replace multiple replace operations. Efficient implementations are also based on dynamic programming .
Taking another look at the similarity of “ab” and “ba”, where the Levenshtein edit distance is 2:
Text similarity calculated using Damerau-Levenshtein returns edit distance equal to 1 because “ab” is changed into “ba” with one transpose operation instead of two replace operations:
The edit distance between “irks” and “risk” is larger with Levenshtein distance than Damerau-Levenshtein distance. Another way to think about this is “irks” and “risk” are more similar if using Damerau-Levenshtein than if using Levenshtein.
Fuzzywuzzy‘s optimized similarity functions are implemented using the python-Levenshtein module. The actual Levenshtein edit distance code is written in C for faster performance. Fuzzywuzzy can fall back to a slower pure Python implementation if python-Levenshtein is not available.
The pyxDamerauLevenshtein module provides Damerau-Levenshtein as a dynamic programming algorithm written in C.
The StringDist module provides both Levenshtein and Damerau-Levenshtein. The algorithms are also implemented in C and falls back to Python if the C implementation can’t be used.
The jellyfish module includes an extensive list of string matching algorithms, including both Levenshtein and Damerau-Levenshtein, all implemented in both C and Python with the choice of using either.
I used longer strings than in the illustrated examples to benchmark the performance and memory usage of these Python tools. The first test computes the edit distances between “this was a test” and “this is a coast”. The second test is the Amelia Earhart quote “the most difficult thing is the decision to act, the rest is merely tenacity” compared with this version mutated with deliberately misspelled words: “teh most difficult thing is the decsion to act, the reast is merely teancty”. The Levenshtein and Damerau-Levenshtein edit distances are equal in the first test. The Damerau-Levenshtein edit distance is smaller than the Levenshtein edit distance in the second test.
Memory usage is consistent for both examples and all tools (approximately 57-58 MiB). There is a lot more variation in performance between the tools: python-Levenshtein was very fast, StringDist and jellyfish also computed edit distances efficiently and were the fastest Damerau-Levenshtein implementations. StringDist performed slightly better in the first test and jellyfish performed better in the second test.