HOW TO PREPARE DATA FOR CLEANING

Introduction to Preparing Your Data

In this section, we will introduce two Flookup functions to help you get your data ready for cleaning: NORMALIZE and FUZZYMATCH.

NORMALIZE helps clean and standardise your data by removing text entries that can negatively affect the fuzzy matching process. FUZZYMATCH provides a standardised way to view how closely matched your text entries are and it also provides a good demonstration of the underlying mechanism used by the other Flookup functions.

NORMALIZE

=NORMALIZE(lookupValue, [stopArray], [threshold])

Use the NORMALIZE function to remove unwanted words, diacritics and punctuation marks from text entries.

NORMALIZE Parameters

    • lookupValue [Required]. The text entries to be normalized. This can be a text entry, a single cell reference or a range of cells.

    • stopArray [Optional]. The values to be removed from "lookupValue". This can either be unwanted words [comma separated] or punctuation marks [no delimiter] contained in a single cell. Please do not mix punctuation marks with unwanted words in your "stopArray" input.

    • threshold [Optional]. The minimum percentage similarity between "lookupValue" and "stopArray". The default value is 1.

Using the Long Run Mode

    1. Head to Extensions > Flookup > Long Run Mode (LRM) > NORMALIZE in your spreadsheet menu.

    2. Select the mode you want this function to run. It can be Remove diacritical marks, Remove punctuation marks or Remove unwanted words.

    3. Range of entries to normalize: Select a column of entries you want to normalise and then click "Grab selected range".

    4. Range of entries to be removed: Select a single cell or column that contains your list of unwanted words or punctuation marks and then click "Grab selected range".

    5. Threshold: This optional step and it is only active while removing unwanted words. It defines the minimum level of similarity between the entries to normalize and the unwanted words.

    6. Click an empty cell where you want your results to be displayed.

    7. Click "Normalize text entries".

Key Points

  • When adjusting "Threshold", please use the arrow buttons at the extreme right of the respective input field.

  • NORMALIZE will create a new list without modifying the original list.

  • The "stopArray" parameter is optional. If you do not include it in your formula, the function will default to removing diacritics.

  • NORMALIZE LRM will normalize the leftmost column of that range if you select more than one column. To normalize text entries from any column in a range, you need to select a single column.

  • When NORMALIZE LRM times out, the results that have been processed up to that point will be displayed.

FUZZYMATCH

=FUZZYMATCH(string1, string2)

Use the FUZZYMATCH function to calculate the percentage similarity between text entries and return the result in decimal form.

FUZZYMATCH Parameters

    • string1 [Required]. The primary text entry to be compared to "string2".

    • string2 [Required]. The secondary text entry to be compared to "string1".

Using the Long Run Mode

    1. Head to Extensions > Flookup > Long Run Mode (LRM) > FUZZYMATCH in your spreadsheet menu.

    2. Primary range: Select range of one or more columns and click "Grab selected range".

    3. Secondary range: Select a separate range of one or more columns and then click "Grab selected range".

    4. Index One: Enter the index of the column of values in "Primary range" that you want to compare to values from "Secondary range".

    5. Index Two: Enter the index of the column of values in "Secondary range" that you want to compare to values from "Primary range".

    6. Click an empty cell in a column where you want your results to be displayed.

    7. Click "Calculate percentage similarities".

Key Points

    • When adjusting "Index One" or "Index Two", please use the arrow buttons at the extreme right of the respective input field.

    • FUZZYMATCH parameters can have interchanged arguments without affecting the result.

    • FUZZYMATCH parameters can take ranges as arguments. However, when using this mode, "string1" must not be longer than "string2".

  • FUZZYMATCH LRM compares text entries that are on the same row e.g. A1 to B1, A2 to B2, etc. and "Primary range" must not be longer than "Secondary range".

  • When FUZZYMATCH LRM times out, the results that have been processed up to that point will be displayed.