FUZZY MATCHING ALGORITHMS EXPLAINED

What is Fuzzy Matching?

Fuzzy matching is a technique of finding strings that match other strings approximately, rather than exactly. The problem of fuzzy matching is typically divided into two sub-problems:

  • Finding approximate substring matches inside a given text entry.

  • Finding dictionary text entries that approximately match a specific pattern.

Fuzzy matching is known by several names including Fuzzy String Matching, Approximate String Matching and Fuzzy String Searching. Most fuzzy matching algorithms produce percentages that can help users gauge how similar the compared text entries are, with a typical scale ranging from 0% [no similarity] to 100% [exact match].

Why Use Fuzzy Matching Software?

Real-world data is rarely stored in a standardised format because of the different methods of collecting and processing the same. This leads to differences in spelling, formatting and other common data entry issues. Despite these complexities, the process of data cleaning, especially efficiency, can be vastly improved by making use of fuzzy matching software.

Fuzzy matching algorithms have been successfully applied in areas like spell checking, spam filtering and record linkage. To find out how to make use of these and other advantages for your business, please click here.

Fuzzy Matching Algorithms

As indicated above, there are different approaches to fuzzy matching and each of them is expressed in a different fuzzy matching algorithm. Here is a brief look at the most popular fuzzy matching algorithms and how they work:

  • Levenshtein distance: This is the minimum number of single-character edits that are required to transform one word into another. Valid edits are insertions, deletions or substitutions.

  • Damerau–Levenshtein distance: This is the minimum number of operations needed to transform one string into the other. Valid operations are insertions, deletions, substitutions or transpositions of adjacent characters.

  • Cosine Similarity: This is a metric that is got by converting strings into vectors by counting characters appearing in both strings and then calculating the dot product of the two vectors.

  • N-Gram: This is a contiguous sequence of n items from a given sequence of text. The items can be phonemes, syllables, letters, words or base pairs according to the application.

  • Soundex: This is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for similar sounding words to be encoded to the same representation so that they can be compared and matched, despite minor differences in spelling. Flookup uses this for matching text by sound similarity.

  • Peregrine: This is our main fuzzy matching algorithm, and it is developed inhouse. It works by counting substrings appearing in both text entries and then applies minimising a function to eliminate incidences of false positives. The resulting percentage similarity is, therefore, calculated by only considering unique substrings.

  • Your Brain: "Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and Isat Itteers be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe."

Fuzzy Matching in Action: A Real-World Example

Record linkage techniques can be used to detect fraud, resource wastage or abuse of government programs. In this example, two databases were merged and compared for inconsistencies, leading to a discovery that helped the U.S. government put a stop to fraudulent behaviour by some government employees:

In a period of 18 months leading to the summer of 2005, a database consisting of records on 40,000 aeroplane pilots licensed by the U.S. Federal Aviation Administration and residing in Northern California, was matched to a database consisting of individuals receiving disability payments from the Social Security Administration, and it was discovered that the names of some pilots appeared in both databases.

A prosecutor in the U.S. Attorney’s Office in Fresno, California stated the following, according to an AP report: “There was probably criminal wrongdoing. The pilots were either lying to the FAA or wrongfully receiving benefits. The pilots claimed to be medically fit to fly airplanes. However, they may have been flying with debilitating illnesses that should have kept them grounded, ranging from schizophrenia and bipolar disorder to drug and alcohol addiction and heart conditions.”

In the end, 40 pilots were charged with the crimes of "making false statements to a government agency" and "making and delivering a false official writing". The FAA also suspended licenses of 14 pilots in total, while others were put on notice pending further investigations.