PREPROCESS DATA BY TEXT SIMILARITY
Introduction to Data Preprocessing
In this section, you will learn how to use two powerful Flookup functions that can make your data cleaning easier, faster and prone to fewer errors: NORMALIZE and FUZZYMATCH.
NORMALIZE can improve the quality and consistency of your data by removing or formatting text entries that might interfere with the fuzzy matching process.
FUZZYMATCH can help you understand your data better by showing you how similar your text entries are. It also gives you a glimpse of the underlying mechanism that drives the other Flookup functions.
Normalize function modes can be divided into two broad categories:
CATEGORY #1: Remove punctuation marks, Remove unwanted words. These will preserve the original dataset.
CATEGORY #2: Remove diacritical marks, Keep URL domain only, Keep URL path only. These will modify the original dataset.
To normalize text entries in the first category of functions, follow the steps below:
Head to Extensions > Flookup Data Wrangler > Match and Merge Functions > NORMALIZE in your spreadsheet menu.
Select the mode you want this function to run from the drop-down menu [See Explanation Below].
Range of entries to normalize: Select a column of entries you want to normalise and then click "Grab selected range".
Range of entries to be removed: If running Category #1 functions, select a single cell that contains your list of unwanted words or punctuation marks and then click "Grab selected range".
Threshold: [Optional] This parameter is only active when removing unwanted words. It defines the minimum level of similarity between words to be removed from each text entry in your dataset and the predefined list of unwanted words.
If running Category #1 functions, click an empty cell where you would like your results to be displayed. This cell should be in any sheet inside that same workbook.
Click "Normalize text entries".
NORMALIZE Modes Explained
Remove diacritical marks: This removes all diacritical marks from text entries.
Remove punctuation marks: This will remove all specified punctuation marks from the target text entries.
Remove unwanted words: This will remove all specified words from the target text entries.
Keep URL domain only: This strips a URL of everything except its root domain e.g. "www.example.com" to "example.com".
Keep URL path only: This strips a URL of everything except its path e.g. "www.example.com/new/users" to "/new/users".
Key Points on NORMALIZE
When removing punctuation marks, Range of entries to be removed should take a cell with values that are not separated by any delimiter e.g. "?,.@/".
When removing unwanted words, Range of entries to be removed should take a cell will comma separated values e.g. "company, limited, Mr, Mrs".
NORMALIZE will normalize the leftmost column of that range if you select more than one column. To normalize text entries from any column in a range, you need to select a single column.
When NORMALIZE times out, the results that have been processed up to that point will be displayed.
Head to Extensions > Flookup Data Wrangler > Match and Merge Functions > FUZZYMATCH in your spreadsheet menu.
Primary range: Select range of one or more columns, containing one or more rows and click "Grab selected range".
Secondary range: Select a separate range of one or more columns and then click "Grab selected range".
Left_column: Enter the index of the column of values in "Primary range" that you want to compare to values from "Secondary range".
Right_column: Enter the index of the column of values in "Secondary range" that you want to compare to values from "Primary range".
Click an empty cell in a column to mark where you would like your results to be displayed, inside that same workbook.
Click "Calculate percentage similarities".
Key Points on FUZZYMATCH
FUZZYMATCH can compare one text entry to multiple rows e.g. A1 to A2:B2 or text entries that are on the same row e.g. A1 to B1, A2 to B2, etc. and "Primary range" must not be longer than "Secondary range".
When FUZZYMATCH times out, the results that have been processed up to that point will be displayed.