PREPROCESS DATA BY TEXT SIMILARITY
Introduction to Data Preprocessing
In this guide, you will learn how to use two powerful Flookup functions that can make your data cleaning easier, faster and susceptible to fewer errors: NORMALIZE and FUZZYMATCH.
NORMALIZE improves the quality and consistency of your data by removing or formatting text entries that might interfere with the fuzzy matching process.
FUZZYMATCH helps you understand your data better by showing you how similar your text entries are. It also gives you a glimpse of the underlying mechanism that drives the other Flookup functions.
NORMALIZE
This function can either modify the original dataset in place or leave the original dataset unchanged. Here is a condensed look at what each function mode does:
Remove punctuation marks: This will remove all specified punctuation marks from the selected text entries.
Remove unwanted words: This will remove all specified words from the selected text entries.
Remove diacritical marks: This removes all diacritical marks from text entries.
Keep URL domain only: This strips a URL of everything except its root domain e.g. "www.example.com" to "example".
Keep URL path only: This strips a URL of everything except its path e.g. "www.example.com/new/users" to "/new/users".
-----
To normalize text entries by removing punctuation marks or removing unwanted words, follow the steps below:
Open the function sidebar
Go to Extensions > Flookup Data Wrangler > Lookup and merge > NORMALIZE in your spreadsheet menu.
Select the mode to run
Choose the mode for the function from the drop-down menu.
Select the entries to normalize
Highlight the column of text entries to normalize and click "Grab selected range".
Select the entries to remove
Highlight the cell containing stop words (comma separated) or punctuation marks (no separation) and click "Grab selected range".
Specify the result location
Click an empty cell to indicate where the results should be displayed. This cell should be in any sheet inside that same workbook.
Normalize the dataset
Click the "Normalize text entries" button.
-----
To normalize text entries by removing diacritical marks, keeping URL domain or keeping the URL path, follow the steps below:
Open the function sidebar
Go to Extensions > Flookup Data Wrangler > Lookup and merge > NORMALIZE in your spreadsheet menu.
Select the mode to run
Choose the mode for the function from the drop-down menu.
Select the entries to normalize
Highlight the column of entries to normalize and click "Grab selected range".
Normalize the text entries
Click the "Normalize text entries" button.
-----
Notes On Normalizing Data
The Threshold parameter is only active when removing unwanted words and you should adjust it at step #5 above. It defines the minimum level of similarity between words to be removed from each text entry in your dataset and the predefined list of unwanted words.
When removing punctuation marks, Range of entries to be removed should contain values that are not separated by any delimiter e.g. "?,.@/".
When removing unwanted words, Range of entries to be removed should contain comma separated values e.g. "company, limited, Mr, Mrs".
NORMALIZE will process data in the leftmost column of that range if you select more than one column. To normalize text entries from any column within a range, you need to select a single column.
-----
NORMALIZE Custom Function
To use the custom function variant of NORMALIZE e.g. =NORMALIZE(A1:A100), click here.
FUZZYMATCH
Open the sidebar
Go to Extensions > Flookup Data Wrangler > Lookup and merge > FUZZYMATCH in your spreadsheet menu.
Select the mode to run
Choose the mode for the function from the drop-down menu:
Compare short text entries
Compare long text entries
Select the Primary Range
Highlight the range of one or more columns containing multiple cells and click "Grab selected range".
Select the Secondary Range
Highlight a separate range of one or more columns and click "Grab selected range".
Specify the columns to compare
Left_column: Enter the index of the column in "Primary range" to compare.
Right_column: Enter the index of the column in "Secondary range" to compare.
Specify the output location
Click an empty cell to mark the column where you would like your results to be displayed, inside that workbook.
Calculate similarities
Click the "Calculate percentage similarities" button.
-----
Notes On Comparing Text for Similarity
FUZZYMATCH can compare one text entry to multiple rows e.g. A1 to A2:A50 or text entries that are on the same row e.g. A1 to B1, A2 to B2, etc. but "Primary range" must not be longer than "Secondary range".
Examples of short text entries are names and phrases. Long text entries are sentences, paragraphs, etc.
-----
FUZZYMATCH Custom Function
To use the custom function variant of FUZZYMATCH e.g. =FUZZYMATCH(A1:A100, B1:B100), click here.