FUZZY MATCHING ALGORITHMS EXPLAINED
What is Fuzzy Matching?
Fuzzy matching is a technique of finding strings in a dataset, that approximately match strings in a separate dataset, rather than exactly. The discipline of fuzzy matching can be typically sub-divided into two problems:
Finding approximate substring matches inside any given text entry.
Finding dictionary text entries that approximately match a specific pattern.
Fuzzy matching is known by several names including fuzzy string matching and approximate string matching. Most fuzzy matching algorithms return similarity scores as percentages to help users gauge how similar the compared text entries are, with a typical scale ranging from 0% for no matches to 100% for exact matches.
Why Use Fuzzy Matching Software?
Data in the real world is often not stored in uniform formats due to the variety of methods used in data collection and processing. This diversity can lead to discrepancies in data entry, such as variations in spelling and formatting. However, these challenges can be significantly mitigated with the use of fuzzy matching software during the data cleaning process.
Fuzzy matching software can aid in identifying and rectifying text-based discrepancies within your datasets. This feature is especially beneficial when dealing with non-standardised data, reducing the number of manual data cleaning operations.
A well-designed fuzzy matching tool eliminates the need for costly and time-consuming tasks such as fresh coding or algorithm development. This allows business users and technical teams to focus their efforts on addressing data processing challenges, rather than being burdened with additional tasks. The use of such a tool not only improves efficiency but also optimises resource allocation.
What Can Fuzzy Matching Software Do?
Fuzzy matching algorithms have been successfully applied in areas like spell checking and record linkage. Here is a brief look at what these applications can be used for:
Record linkage: Fuzzy matching software can seamlessly link closely or loosley related records across multiple data sources. This creates a unified identity, providing a holistic view of each entity, be it a customer, product or any other subject of interest.
Data deduplication: This software can efficiently merge duplicate records within extensive datasets. This not only reduces redundancy but also improves the accuracy of data analysis and insights.
Spelling variation analysis: Fuzzy matching software is adept at detecting and correcting spelling errors, typos, or variations in customer data. This ensures precise search and analysis, enhancing the quality of customer interactions and engagements.
Data standardization: This software can link records with abbreviations and acronyms. For example, it can match "Limited" with "Ltd", ensuring a uniform format across the dataset. This standardisation facilitates easier data management and more accurate analytics.
Data integration: Fuzzy matching software can consolidate data from diverse sources into a single on-premises platform. This allows for straightforward data sanitation, ensuring that your data is clean, consistent, and ready for analysis.
Name variation matching: It can manage variations in names, titles, or prefixes. This ensures accurate customer profiling and personalised communication, enhancing customer experience and satisfaction.
Minimising the Impact of False Positives
Set a fuzzy match threshold: Establish a fuzzy match threshold for your particular dataset, a level where anything below will not be considered a match. Values that are too low will increase the likelihood of false positives, while values that are too high increase the likelihood of false negatives.
Refine your lookup criteria: Do not just rely on one data point for matching. Consider including other factors like addresses and social security numbers for a more robust fuzzy matching operation.
Quality over quantity: Make sure your main dataset is clean, comprehensive and current. Compromised datasets will always lead to corrupted match results.
Expert Review: Have a domain expert review the results of the match operation. An expert, with their in-depth knowledge of your data, can be instrumental in developing and fine-tuning the data-matching algorithm, as well as reviewing the results. For instance, if you are matching a school database, consulting someone who understands why certain information might be missing or unrecorded could be beneficial.
Fuzzy Matching in Action: A Real-World Example
Record linkage techniques can be used to detect fraud, resource wastage or abuse. In this story, two databases were merged and compared for inconsistencies, leading to a discovery that helped the U.S. government put a stop to fraudulent behaviour by some government employees:
In a period of 18 months leading to the summer of 2005, a database comprising records of 40,000 pilots licensed by the U.S. Federal Aviation Administration and residing in Northern California, was matched to a database consisting of individuals receiving disability payments from the Social Security Administration, and it was discovered that names of some pilots appeared in both databases.
In a report by the Associated Press, a prosecutor from the U.S. Attorney’s Office in Fresno, CA stated the following:
There was probably criminal wrongdoing. The pilots were either lying to the FAA or wrongfully receiving benefits. The pilots claimed to be medically fit to fly airplanes. However, they may have been flying with debilitating illnesses that should have kept them grounded, ranging from schizophrenia and bipolar disorder to drug and alcohol addiction and heart conditions.
In the end, at least 40 pilots were charged with the crimes of "making false statements to a government agency" and "making and delivering a false official writing". The FAA also suspended licenses of 14 pilots in total, while others were put on notice pending further investigations.
Popular Fuzzy Matching Algorithms
Cosine Similarity: It is used to measure the similarity between two strings by representing them as vectors in an n-dimensional vector space. The cosine of the angle between these two vectors is calculated, with a score ranging from 0 to 1.
Levenshtein Distance: It calculates the minimum number of single-character edits that are required to transform one word into another. Valid edits are insertions, deletions or substitutions.
Peregrine: It is our own fuzzy matching algorithm and it was developed by Andrew Apell. It calculates the percentage similarity between the unique substrings contained in any two text entries.
Damerau–Levenshtein Distance: It calculates the minimum number of edits that are required to transform one word into the other. Valid edits are insertions, deletions, substitutions or transpositions of adjacent characters.
Soundex: This algorithm indexes words by sound, as pronounced in English. The goal is for similar sounding words to be encoded to the same representation so that they can be compared, despite minor differences in spelling. Flookup uses a refined version of Soundex for matching text by sound similarity.
n-gram: It is a contiguous sequence of n items from any given text entry. It can be a sequence of syllables, letters, phonemes, words or base pairs according to the application.