STANDARDISE DATA BY TEXT SIMILARITY
How to Standardise Datasets
This function can either modify the original dataset in place or leave the original dataset unchanged. Here is a condensed look at what each function mode does:
- Remove punctuation marks: This will remove all specified punctuation marks from the selected text entries.
- Remove unwanted words: This will remove all specified words from the selected text entries.
- Remove diacritical marks: This removes all diacritical marks from text entries.
- Keep URL domain only: This strips a URL of everything except its root domain, e.g. "www.example.com" to "example".
- Keep URL path only: This strips a URL of everything except its path, e.g. "www.example.com/new/users" to "/new/users".
Step-by-Step: Remove Punctuation or Unwanted Words
- Open the function sidebar
Extensions > Flookup Data Wrangler > Transformation functions > Preprocess and standardise in your spreadsheet menu. - Select the mode to run
Choose the mode for the function from the drop-down menu. - Select the entries to normalise
Highlight the column of text entries to normalise and click Grab selected range. - Select the entries to remove
Highlight the cell containing stop words (comma separated) or punctuation marks (no separation) and click Grab selected range. - Specify the result location
Click an empty cell to indicate where the results should be displayed. This cell should be in any sheet inside that same workbook. - Normalise the dataset
Click the Normalise text entries button.
Step-by-Step: Remove Diacritical Marks, Keep URL Domain or Path
- Open the function sidebar
Extensions > Flookup Data Wrangler > Transformation functions > Preprocess and standardise in your spreadsheet menu. - Select the mode to run
Choose the mode for the function from the drop-down menu. - Select the entries to normalise
Highlight the column of entries to normalise and click Grab selected range. - Normalise the text entries
Click the Normalise text entries button.
Notes on Standardising Data
- The Threshold parameter is only active when removing unwanted words and you should adjust it at step #5 above. It defines the minimum level of similarity between words to be removed from each text entry in your dataset and the predefined list of unwanted words.
- When removing punctuation marks, the range of entries to be removed should contain values that are not separated by any delimiter, e.g. ?,.@/.
- When removing unwanted words, the range of entries to be removed should contain comma separated values, e.g. company, limited, Mr, Mrs.
- This function will process data in the leftmost column of that range if you select more than one column. To normalise text entries from any column within a range, you need to select a single column.
For the Visual Learners
Labels might differ slightly but the steps are the same.