PREPROCESS DATA BY TEXT SIMILARITY
Introduction to Data Preprocessing
In this guide, you will learn how to use two powerful Flookup functions that can make your data cleaning easier, faster and susceptible to fewer errors: NORMALIZE and FUZZYMATCH.
NORMALIZE can improve the quality and consistency of your data by removing or formatting text entries that might interfere with the fuzzy matching process.
FUZZYMATCH can help you understand your data better by showing you how similar your text entries are. It also gives you a glimpse of the underlying mechanism that drives the other Flookup functions.
NORMALIZE
Normalize function modes can be divided into two broad groups:
Group I: Remove diacritical marks, Keep URL domain only, Keep URL path only. These will modify the original dataset in place.
Group II: Remove punctuation marks, Remove unwanted words. These will leave the original dataset unchanged.
Here is a condensed look at what each NORMALIZE Mode does:
Remove diacritical marks: This removes all diacritical marks from text entries.
Remove punctuation marks: This will remove all specified punctuation marks from the selected text entries.
Remove unwanted words: This will remove all specified words from the selected text entries.
Keep URL domain only: This strips a URL of everything except its root domain e.g. "www.example.com" to "example".
Keep URL path only: This strips a URL of everything except its path e.g. "www.example.com/new/users" to "/new/users".
To normalize text entries in the "Group I" functions, follow the steps below:
Head to Extensions > Flookup Data Wrangler > Preprocess data > NORMALIZE in your spreadsheet menu.
Select the mode you want this function to run from the drop-down menu. See the explanation below.
Range of entries to normalize: Select a column of entries you want to normalise and then click the "Grab selected range" button.
Click the "Normalize text entries" button.
To normalize text entries in the "Group II" functions, follow the steps below:
Head to Extensions > Flookup Data Wrangler > Preprocess data > NORMALIZE in your spreadsheet menu.
Select the mode you want this function to run from the drop-down menu. Scroll down to see the explanation below.
Range of entries to normalize: Select a column of text entries you would like to normalise and then click the "Grab selected range" button.
Range of entries to be removed: Select a cell that contains stop words or punctuation marks and then click the "Grab selected range" button.
Click an empty cell to indicate the column where you would like your results to be displayed. This cell should be in any sheet inside that same workbook.
Click the "Normalize text entries" button.
---
Threshold: This parameter is only active when removing unwanted words and you should adjust it at step #5 above. It defines the minimum level of similarity between words to be removed from each text entry in your dataset and the predefined list of unwanted words.
Key Points on NORMALIZE
When removing punctuation marks, Range of entries to be removed should take a cell with values that are not separated by any delimiter e.g. "?,.@/".
When removing unwanted words, Range of entries to be removed should take a cell will comma separated values e.g. "company, limited, Mr, Mrs".
NORMALIZE will normalize the leftmost column of that range if you select more than one column. To normalize text entries from any column in a range, you need to select a single column.
When NORMALIZE times out, the results that have been processed up to that point will be displayed.
FUZZYMATCH
Head to Extensions > Flookup Data Wrangler > Preprocess data > FUZZYMATCH in your spreadsheet menu.
Select the mode you want this function to run from the drop-down menu. Your choices are: Compare short text entries and Compare long text entries.
Primary range: Select range of one or more columns containing multiple cells and click the "Grab selected range" button.
Secondary range: Select a separate range of one or more columns and then click the "Grab selected range" button.
Left_column: Enter the index of the column of values in "Primary range" that you want to compare to values from "Secondary range".
Right_column: Enter the index of the column of values in "Secondary range" that you want to compare to values from "Primary range".
Click an empty cell to mark the column where you would like your results to be displayed, inside that workbook.
Click the "Calculate percentage similarities" button.
Key Points on FUZZYMATCH
FUZZYMATCH can compare one text entry to multiple rows e.g. A1 to A2:A50 or text entries that are on the same row e.g. A1 to B1, A2 to B2, etc. but "Primary range" must not be longer than "Secondary range".
Examples of short text entries are names and phrases. Long text entries are sentences, paragraphs, etc.
When FUZZYMATCH times out, the results that have been processed up to that point will be displayed.