FUZZY MATCHING ALGORITHMS EXPLAINED
What is Fuzzy Matching?
Fuzzy matching is a technique of finding strings in a dataset, that approximately match strings in a separate dataset, rather than exactly. The discipline of fuzzy matching can be typically divided into two sub-problems:
Finding approximate substring matches inside a given text entry.
Finding dictionary text entries that approximately match a specific pattern.
Fuzzy matching is known by several names including Fuzzy String Matching and Approximate String Matching. Most fuzzy matching algorithms return similarity scores as percentages to help users gauge how similar the compared text entries are, with a typical scale ranging from 0% (no similarity) to 100% (exact match).
Why Use Fuzzy Matching Software?
Real-world data is very rarely stored in standardised formats because of the different methods of collecting and processing the same.
This usually leads to differences in spelling, formatting and other common data entry inconsistencies. Despite these text-based differences, the entire process of data cleaning can be vastly improved by making use of fuzzy matching software.
Fuzzy matching algorithms have been successfully applied in areas like nucleotide sequence matching, spell checking, spam filtering and record linkage. To find out how to make use of these and other advantages for your business, please check out this article by Data Ladder.
Fuzzy Matching in Action: A Real-World Example
Record linkage techniques can be used to detect fraud, resource wastage or abuse. In this example, two databases were merged and compared for inconsistencies, leading to a discovery that helped the U.S. government put a stop to fraudulent behaviour by some government employees:
In a period of 18 months leading to the summer of 2005, a database comprising records of 40,000 pilots licensed by the U.S. Federal Aviation Administration and residing in Northern California, was matched to a database consisting of individuals receiving disability payments from the Social Security Administration, and it was discovered that names of some pilots appeared in both databases.
According to a report by the Associated Press, a prosecutor from the U.S. Attorney’s Office in Fresno stated the following: There was probably criminal wrongdoing. The pilots were either lying to the FAA or wrongfully receiving benefits. The pilots claimed to be medically fit to fly airplanes. However, they may have been flying with debilitating illnesses that should have kept them grounded, ranging from schizophrenia and bipolar disorder to drug and alcohol addiction and heart conditions.
In the end, at least 40 pilots were charged with the crimes of "making false statements to a government agency" and "making and delivering a false official writing".
The FAA also suspended licenses of 14 pilots in total, while others were put on notice pending further investigations.
Fuzzy Matching Algorithms
Here is a brief look at the most popular fuzzy matching algorithms and how they work:
Cosine Similarity: This is a metric used to measure the similarity between two text entries by representing them as vectors in an n-dimensional vector space. The cosine of the angle between these two vectors is calculated, with a score ranging from 0 to 1.
Levenshtein Distance: This counts the minimum number of single-character edits that are required to transform one word into another. Valid edits are insertions, deletions or substitutions.
Damerau–Levenshtein Distance: This counts the minimum number of edits required to transform one word into the other. Valid edits are insertions, deletions, substitutions or transpositions of adjacent characters.
Peregrine: This is our own fuzzy matching algorithm and it is developed inhouse. It works by calculating the percentage similarity between the unique substrings in any two text entries.
N-Gram: This is a contiguous sequence of n items from a given text entry. These items can be phonemes, syllables, letters, words or base pairs according to the application.
Soundex: This is an algorithm for indexing words by sound, as pronounced in English. The goal is for similar sounding words to be encoded to the same representation so that they can be compared or matched, despite minor differences in spelling. Flookup uses a refined version of Soundex for matching text by sound similarity.