FUZZY MATCHING ALGORITHMS EXPLAINED
What is Fuzzy Matching?
Fuzzy matching is a technique for finding strings in a dataset that approximately match strings in another dataset, rather than requiring exact matches. Also known as fuzzy string matching or approximate string matching, it is essential for cleaning and reconciling real-world data, which is often inconsistent or non-standardised. Most fuzzy matching algorithms return similarity scores as percentages, with 0% indicating no match and 100% indicating an exact match.
What is a Similarity Threshold in Fuzzy Matching?
A similarity threshold defines the minimum acceptable similarity between two strings. For example, a threshold of 0.85 means compared entries must be at least 85% similar to be considered a match. Lower thresholds allow more variation, while higher thresholds demand closer matches. Choosing the right threshold is crucial for balancing false positives and false negatives.
Why Use Fuzzy Matching Software?
Real-world data is rarely uniform due to diverse data collection and entry methods. Fuzzy matching software helps identify and rectify text-based discrepancies, such as spelling variations and formatting differences, reducing manual cleaning effort. A robust fuzzy matching tool streamlines data processing, improves efficiency, and allows both business and technical users to focus on higher-value tasks.
What Can Fuzzy Matching Software Do?
- Record linkage: Link closely related records across multiple data sources for a unified view of each entity.
- Data deduplication: Merge duplicate records within large datasets, reducing redundancy and improving accuracy.
- Spelling variation analysis: Detect and correct spelling errors or typos, ensuring precise search and analysis.
- Data standardisation: Match records with abbreviations or acronyms (e.g., "Limited" with "Ltd") for uniformity.
- Data integration: Consolidate data from diverse sources into a single, clean platform.
- Name variation matching: Manage variations in names, titles, or prefixes for accurate profiling and communication.
Minimising the Impact of False Positives
- Set a fuzzy match threshold: Choose a threshold that balances false positives and false negatives for your dataset.
- Refine your lookup criteria: Use multiple data points (e.g., names and addresses) for more robust matching.
- Expert review: Have a domain expert review match results to fine-tune your approach.
- Quality over quantity: Ensure your main dataset is clean and current for the best results.
Fuzzy Matching in Action: A Real-World Example
Fuzzy matching can be used for record linkage to detect fraud or inconsistencies. For example, in 2005, U.S. government agencies matched pilot licence records with disability payment records, discovering that some pilots were fraudulently claiming benefits while flying. This led to criminal charges and licence suspensions, demonstrating the power of fuzzy matching in real-world data reconciliation.
Popular Fuzzy Matching Algorithms
- Cosine Similarity: Measures similarity by representing strings as vectors and calculating the cosine of the angle between them (score from 0 to 1).
- Levenshtein Distance: Calculates the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one word into another.
- Peregrine: Flookup's proprietary algorithm, developed by Andrew Apell, calculates percentage similarity between unique substrings in two text entries.
- Damerau-Levenshtein Distance: Like Levenshtein, but also allows transpositions of adjacent characters.
- n-gram: Compares contiguous sequences of n items (syllables, letters, words, etc.) from text entries.
- Soundex: Indexes words by sound as pronounced in English. Flookup uses a refined version for sound similarity matching.
- The Human Brain: As research at Cambridge University shows, humans can read words even if the letters are jumbled, as long as the first and last letters are correct.