STANDARDIZING DATA IN USING FLOOKUP

Brief Introduction

The power of a fuzzy matching algorithm lies in its ability to process data despite the presence of any text-based differences like punctuation marks, spelling differences and diacritical marks. However, even with such an advantage, leaving datasets unstandardized will disproportionately skew results.

To make this advantage count, let us look at two ways you can use Flookup to standardize your data before attempting to merge data or any other data cleaning operation. Images used are for illustration purposes only.

Removing Punctuation Marks

Let us assume that you have data with punctuation marks and let us assume that you want to retrieve data from a sperate table with clean or standardized data.

A spreadsheet section showing three columns containing text entries

Here are the recommended steps to follow:

A spreadsheet section showing one populated column next to two empty columns to their right

In cell B2, enter this formula to strip the punctuation marks from Column A:

=NORMALIZE($A$2:$A$10,E2)

Your setup will look like this after the punctation marks have been removed and now you can use the values in Column B to merge data from the second table.

A spreadsheet section showing two populated columns next to one empty column to their right

Removing Unwanted Words

Let us assume that you have a group of words that repeat throughout your dataset and that those words give you unexpectedly high percentage similarity scores like the one shown below:

=FUZZYMATCH("Toy Company", "Coyote Company") > 0.77

In the example above, the word "Company" appears in both text entries and contributes to a high percentage similarity despite the fact that these two entries are not really that similar. If we were to remove the offending text, this is the result we would get:

=FUZZYMATCH("Toy", "Coyote") > 0.33

The level of similarity is now much more acceptable and expected.

Now let us examine how to remove unwanted words like this using the NORMALIZE function. Here are the steps to follow:

A spreadsheet section showing every other column populate with text

Our aim is to merge data to the table above, but it is clear that we have a few words like "Inc, Co, etc.", that repeat throughout the dataset and yet they are unlikely to add value to the overall data cleaning process. In other words, it is in our interest to remove them from the columns we intend to analyse.

To do this, we shall use the NORMALIZE function as shown below:

Please note that you might have to repeat this action on the column with values you want to match in the second table. Doing this will allow us to compare both tables without the bias of stop words. Your formula should look like this:

=NORMALIZE($B$1:$B$10,$E$2)

This is what your setup will look like at the end of the normalization process:

A spreadsheet section showing three populated columns with one empty column to the right

If you have a large dataset, use NORMALIZE LRM. LRM has two key distinguishing features:

Finally, use the data from the newly created columns to continue your data cleaning work using merge functions like FLOOKUP and SOUNDMATCH.

TIP: If you use FLOOKUP on the standardized data, you can do the following to restore your data to its pre-sanitized form:

This formula assumes that the row indices returned by Flookup are "2, 5, 7, 15" and that the pre-sanitized data spans Column C to Column E. Finally, it returns data from Column D, as signified by the number "2".