STANDARDIZE DATA BY TEXT SIMILARITY
Introduction to Standardizing Data
Unlock the power of clean data with Flookup's "Standardize Data" feature, an essential tool for effective data standardization and text normalization within Google Sheets. Whether you need to preprocess datasets by removing punctuation, eliminating unwanted words, stripping diacritics or normalizing URLs, this function streamlines your Google Sheets data cleaning efforts. It can either modify your original dataset in place or create a new, standardized version, leaving your original data untouched. Here is a condensed look at what each function mode does:
- Remove punctuation marks: This will remove all specified punctuation marks from the selected text entries.
- Remove unwanted words: This will remove all specified words from the selected text entries.
- Remove diacritical marks: This removes all diacritical marks from text entries.
- Keep URL domain only: This strips a URL of everything except its root domain, for example "www.example.com" to "example".
- Keep URL path only: This strips a URL of everything except its path, for example "www.example.com/new/users" to "/new/users".
Step-by-Step Guide: Removing Punctuation or Unwanted Words
-
Open the function sidebar
Extensions > Flookup Data Wrangler > Transformation functions > Preprocess and standardize in your spreadsheet menu. - Select the mode to run
Choose the mode for the function from the drop down menu. - Select the entries to standardize
Highlight the column of text entries to standardize and click Grab selected range. -
Select the entries to remove
Highlight the cell containing stop words (comma separated) or punctuation marks (no separation) and click Grab selected range. - Specify the result location
Click an empty cell to indicate where the results should be displayed. This cell should be in any sheet inside that same workbook. - Standardize the dataset
Click the Standardize text entries button.
Step-by-Step Guide: Removing Diacritical Marks, Keeping URL Domain or Path
-
Open the function sidebar
Extensions > Flookup Data Wrangler > Transformation functions > Preprocess and standardize in your spreadsheet menu. - Select the mode to run
Choose the mode for the function from the drop down menu. - Select the entries to standardize
Highlight the column of entries to standardize and click Grab selected range. - Standardize the text entries
Click the Standardize text entries button.
Important Notes on Data Standardization
- The Threshold parameter is only active when removing unwanted words and you should adjust it at step #5 above. It defines the minimum level of similarity between words to be removed from each text entry in your dataset and the predefined list of unwanted words.
- When removing punctuation marks, the range of entries to be removed should contain values that are not separated by any delimiter, for example ?,.@/.
- When removing unwanted words, the range of entries to be removed should contain comma separated values, for example company, limited, Mr, Mrs.
- This function will process data in the leftmost column of that range if you select more than one column. To standardize text entries from any column within a range, you need to select a single column.
Visual Learning Resources
Labels might differ slightly but the steps are the same.