If this helped you, please share!

Python text analysis tools: FuzzyWuzzy’s basic string matching

Published March 29, 2020 in data - 0 Comments

The Python FuzzyWuzzy module uses Levenshtein edit distance to implement fuzzy string matching. FuzzyWuzzy’s matching tools return results on a scale from 0 to 100. The simplest matching tool FuzzyWuzzy offers is the ratio(..) function:

The basic ratio function works well for simple string matching. However if you’re trying to fuzzy match a single word or a shorter string to a longer string, then ratio tends not to return useful match scores:

We are seeing very poor results in these examples considering that the smaller strings are a substring contained in the longer string and a very close match. This is where FuzzyWuzzy’s partial_ratio(..) tool is much more useful:

The partial_ratio function searches a given string for the most similar substring. The algorithm, implemented in FuzzyWuzzy’s SequenceMatcher module, walks over the longer string and compares the shorter string with substrings of the same length (from v0.18.0 code comments: the best partial match will block align with at least one of those blocks). When a partial match of the shorter string against the longer string is found, SequenceMatcher computes a score from 0 to 100. The score from the best match is returned.

When matching simple short strings, ratio and partial_ratio tend to return comparable results depending on how high you want to set your match threshold:

No comments yet

Leave a Reply: