NORMALIZING DATA IN GOOGLE SHEETS FOR BETTER DATA ANALYSIS
Introduction to Normalizing Data
The power of a fuzzy matching algorithm lies in its ability to process data despite any text-based differences like punctuation marks, spelling differences and diacritical marks. Despite the fact that fuzzy matching algorithms can still process text with the aforementioned difficulties, leaving text unstandardized will disproportionately skew results. Thankfully, Flookup has a function to help you standardize the text before attempting to merge data or any other data cleaning operation.
Removing Punctuation Marks
Let us assume that you have data with punctuation marks and let us assume that you want to retrieve data from a sperate table with clean or standardized data.
Create a new column next to the column you want to analyse. In this case, we will insert a new column next to Column A.
On the side, enter the punctuation marks you want to target in a single cell. In our example, we will use Cell E2.
In cell B2, enter this formula to strip the punctuation marks from Column A: =NORMALIZE($A$2:$A$10,E2)
Your setup will look like this after the punctation marks have been removed and now you can use the values in Column B to merge data from the second table.
Removing Unwanted Words
Let us assume that you have a group of words that repeat throughout your dataset and that those words give you unexpectedly high level of similarity scores like the one shown here: =FUZZYMATCH("Toy Company", "Coyote Company") > 0.77
In the example above, the word "Company" appears in both text entries and contributes to a high percentage similarity despite the fact that these two entries are not really that similar. If we were to remove the offending text, this is the result we would get: =FUZZYMATCH("Toy", "Coyote") > 0.33
The level of similarity is now much more acceptable and expected. Now let us examine how to remove such unwanted words like this using the NORMALIZE function.
Our aim is to merge data to the table shown here, but it is clear that we have a few words like "Inc, Co, etc.", that repeat throughout the dataset and yet they are unlikely to add value to the overall data cleaning process.
In order do this, we shall use the NORMALIZE function by following the steps shown below.
Create a new column next to column A, the column with the key values we want to use in our analysis.
Enter comma separated words into any cell on the side. In this example, we shall enter them in cell E2.
Populate column B with normalized versions of text entries retrieved from column A by using this formula: =NORMALIZE($B$1:$B$10,$E$2)
---
This image shows what your spreadsheet setup will look like at the end of the normalization process.