STANDARDIZING DATA IN USING FLOOKUP
Brief Introduction
The power of a fuzzy matching algorithm lies in its ability to process data despite the presence of any text-based differences like punctuation marks, spelling differences and diacritical marks. However, even with such an advantage, leaving datasets unstandardized will disproportionately skew results.
To make this advantage count, let us look at two ways you can use Flookup to standardize your data before attempting to merge data or any other data cleaning operation. Images used are for illustration purposes only.
Removing Punctuation Marks
Let us assume that you have data with punctuation marks and let us assume that you want to retrieve data from a sperate table with clean or standardized data.
Here are the recommended steps to follow:
Create a new column next to the column you want to analyse. In this case, we will insert a new column next to Column A.
On the side, enter the punctuation marks you want to target in a single cell. In our example, we will use Cell E2.
In cell B2, enter this formula to strip the punctuation marks from Column A:
=NORMALIZE($A$2:$A$10,E2)
Your setup will look like this after the punctation marks have been removed and now you can use the values in Column B to merge data from the second table.
Removing Unwanted Words
Let us assume that you have a group of words that repeat throughout your dataset and that those words give you unexpectedly high percentage similarity scores like the one shown below:
=FUZZYMATCH("Toy Company", "Coyote Company") > 0.77
In the example above, the word "Company" appears in both text entries and contributes to a high percentage similarity despite the fact that these two entries are not really that similar. If we were to remove the offending text, this is the result we would get:
=FUZZYMATCH("Toy", "Coyote") > 0.33
The level of similarity is now much more acceptable and expected.
Now let us examine how to remove unwanted words like this using the NORMALIZE function. Here are the steps to follow:
Our aim is to merge data to the table above, but it is clear that we have a few words like "Inc, Co, etc.", that repeat throughout the dataset and yet they are unlikely to add value to the overall data cleaning process. In other words, it is in our interest to remove them from the columns we intend to analyse.
To do this, we shall use the NORMALIZE function as shown below:
First, we create a new column next to Column A, the column with the key values we want to use in our analysis.
Next, we enter comma separated words into any cell on the side. In this example, we shall enter them in Cell E2.
Finally, we strip the unwanted words from Column A and populate Column B with the normalized text entries.
Please note that you might have to repeat this action on the column with values you want to match in the second table. Doing this will allow us to compare both tables without the bias of stop words. Your formula should look like this:
=NORMALIZE($B$1:$B$10,$E$2)
This is what your setup will look like at the end of the normalization process:
If you have a large dataset, use NORMALIZE LRM. LRM has two key distinguishing features:
It is operated via a GUI instead of typing the formulas inside a cell.
Each instance runs for 6 minutes instead of 30 seconds. This means that it has the potential to process large amounts of data every time it is executed.
Finally, use the data from the newly created columns to continue your data cleaning work using merge functions like FLOOKUP and SOUNDMATCH.
TIP: If you use FLOOKUP on the standardized data, you can do the following to restore your data to its pre-sanitized form:
Use FLOOKUP LRM to return the match plus its respective row index number.
In a new column, enter the following function to work through those row indices and return the corresponding data: =ARRAYFORMULA(VLOOKUP({2;5;7;15},$C$1:$E$500,2,FALSE))
This formula assumes that the row indices returned by Flookup are "2, 5, 7, 15" and that the pre-sanitized data spans Column C to Column E. Finally, it returns data from Column D, as signified by the number "2".