FUZZY MATCHING ALGORITHMS EXPLAINED
What is Fuzzy Matching?
Fuzzy matching is a technique of finding strings that match other strings approximately, rather than exactly. The problem of fuzzy matching is typically divided into two sub-problems:
Finding approximate substring matches inside a given text entry.
Finding dictionary text entries that approximately match a specific pattern.
Fuzzy matching goes by different names including Fuzzy String Matching, Approximate String Matching and Fuzzy String Searching. Most fuzzy matching algorithms produce percentages that can help us gauge how similar the compared text entries are, with a typical scale ranging from 0% (no similarity) to 100% (exact match).
Fuzzy Matching Algorithms
As indicated above, there are different approaches to fuzzy matching and each of them is expressed in a different fuzzy matching algorithm. Here is a brief look at the most popular fuzzy matching algorithms and how they work:
Levenshtein distance: The minimum number of single-character edits that required to transform one word into another. Valid edits are insertions, deletions or substitutions.
Damerau–Levenshtein distance: Similar to Levenshtein distance, this is the minimum number of operations needed to transform one string into the other. Valid operations are insertions, deletions, substitutions or transpositions of adjacent characters.
N-Gram: This is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.
Soundex: Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be compared and matched, despite minor differences in spelling.
Cosine Similarity: This metric is got by converting strings into vectors by counting characters appearing in both strings and then calculating the dot product of the two vectors.
Peregrine: Next, we have our own fuzzy matching algorithm. It works by counting substrings appearing in both text entries and then minimising a probabilistic function to eliminate incidences of false positives. The resulting percentage similarity is, therefore, calculated by only considering distinct substrings.
Your Brain: "Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and Isat Itteers be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe."
Why Use Fuzzy Matching Software?
While data cleaning, fuzzy matching can improve your chances of success by reducing your error rate, while simultaneously increasing your efficiency and productivity.
Real-world data rarely stored in a standardised format because of the different ways of collecting and processing the same. This leads to differences in spelling, formatting and other common data issues. Despite these complexities, one can make data cleaning easier by making use of fuzzy matching software to reduce the work-hours that are otherwise wasted in the process.
Fuzzy matching algorithms have been successfully used in the areas of spell checking, spam filtering and record linkage. To find out how to make use of these and other advantages for your business, please click here.