Introduction to Normalizing Data

The power of a fuzzy matching algorithm lies in its ability to process data despite any text-based differences like punctuation marks, spelling differences and diacritical marks. Despite the fact that fuzzy matching algorithms can still process text with the aforementioned difficulties, leaving text unstandardized will disproportionately skew results. Thankfully, Flookup has a function to help you standardize the text before attempting to merge data or any other data cleaning operation.

Removing Punctuation Marks

A spreadsheet section showing three columns containing text entries

Let us assume that you have data with punctuation marks and let us assume that you want to retrieve data from a sperate table with clean or standardized data.

A spreadsheet section showing one populated column next to two empty columns to their right
A spreadsheet section showing two populated columns next to one empty column to their right

In cell B2, enter this formula to strip the punctuation marks from Column A: =NORMALIZE($A$2:$A$10,E2)

Your setup will look like this after the punctation marks have been removed and now you can use the values in Column B to merge data from the second table.

Removing Unwanted Words

Let us assume that you have a group of words that repeat throughout your dataset and that those words give you unexpectedly high level of similarity scores like the one shown here: =FUZZYMATCH("Toy Company", "Coyote Company") > 0.77

In the example above, the word "Company" appears in both text entries and contributes to a high percentage similarity despite the fact that these two entries are not really that similar. If we were to remove the offending text, this is the result we would get: =FUZZYMATCH("Toy", "Coyote") > 0.33

The level of similarity is now much more acceptable and expected. Now let us examine how to remove such unwanted words like this using the NORMALIZE function.

A spreadsheet section showing every other column populate with text

Our aim is to merge data to the table shown here, but it is clear that we have a few words like "Inc, Co, etc.", that repeat throughout the dataset and yet they are unlikely to add value to the overall data cleaning process.

In order do this, we shall use the NORMALIZE function by following the steps shown below.

A spreadsheet section showing three populated columns with one empty column to the right