STANDARDIZING DATA IN USING FLOOKUP
The power of a fuzzy matching algorithm lies in its ability to process data despite any text-based differences like punctuation marks, spelling differences and diacritical marks. Despite the fact that fuzzy matching algorithms can still process text with the aforementioned difficulties, leaving text unstandardized will disproportionately skew results. Thankfully, Flookup has a function to help you standardize the text before attempting to merge data or any other data cleaning operation.
Removing Punctuation Marks
Let us assume that you have data with punctuation marks and let us assume that you want to retrieve data from a sperate table with clean or standardized data.
Here are the recommended steps to follow:
Create a new column next to the column you want to analyse. In this case, we will insert a new column next to Column A.
On the side, enter the punctuation marks you want to target in a single cell. In our example, we will use Cell E2.
Your setup should look like this at this point:
In cell B2, enter this formula to strip the punctuation marks from Column A:
Your setup will look like this after the punctation marks have been removed and now you can use the values in Column B to merge data from the second table.
Removing Unwanted Words
Let us assume that you have a group of words that repeat throughout your dataset and that those words give you unexpectedly high percentage similarity scores like the one shown below:
=FUZZYMATCH("Toy Company", "Coyote Company") > 0.77
In the example above, the word "Company" appears in both text entries and contributes to a high percentage similarity despite the fact that these two entries are not really that similar. If we were to remove the offending text, this is the result we would get:
=FUZZYMATCH("Toy", "Coyote") > 0.33
The level of similarity is now much more acceptable and expected.
Now let us examine how to remove unwanted words like this using the NORMALIZE function. Here are the steps to follow:
Our aim is to merge data to the table above, but it is clear that we have a few words like "Inc, Co, etc.", that repeat throughout the dataset and yet they are unlikely to add value to the overall data cleaning process.
In other words, it is in our interest to remove them from the columns we intend to analyse.
To do this, we shall use the NORMALIZE function as shown below:
First, we create a new column next to Column A, the column with the key values we want to use in our analysis.
Next, we enter comma separated words into any cell on the side. In this example, we shall enter them in cell E2.
Finally, we strip the unwanted words from Column A and populate Column B with the normalized text entries.
Please note that you might have to repeat this action on the column with values you want to match in the second table. Doing this will allow us to compare both tables without the bias of stop words. Your formula should look like this: =NORMALIZE($B$1:$B$10,$E$2)
This is what your setup will look like at the end of the normalization process:
If you have a large dataset, use NORMALIZE LRM. LRM has two key distinguishing features:
It is operated via a GUI instead of typing the formulas inside a cell.
Each instance runs for 6 minutes instead of 30 seconds. This means that it has the potential to process large amounts of data every time it is executed.
Finally, use the data from the newly created columns to continue your data cleaning work using lookup/merge functions like FLOOKUP and SOUNDMATCH.