HOW TO PREPARE DATA FOR CLEANING

Introduction to Preparing Your Data

In this section, we will introduce two Flookup functions to help you get your data ready for cleaning: NORMALIZE and FUZZYMATCH.

NORMALIZE helps clean and standardise your data by removing text entries that can negatively affect the fuzzy matching process. FUZZYMATCH provides a standardised way to view how closely matched your text entries are and it also provides a good demonstration of the underlying mechanism used by the other Flookup functions.

FLOOKUP FUNCTION TYPES: Flookup has two types of functions. One type is accessible by typing a formula directly inside the spreadsheet cell and the other type is accessible through the spreadsheet menu. The former times out after 30 seconds, while the latter times out after 6 minutes.

NORMALIZE

=NORMALIZE(lookupValue, [stopArray])

Use the NORMALIZE function to remove unwanted words, diacritics and punctuation marks from text entries.

To remove diacritics, feed the "lookupValue" parameter with a single cell or a range of cells, as shown in the examples below:

  • =NORMALIZE(A1)

  • =NORMALIZE(A1:A3000)

To remove unwanted words or punctuation marks, feed the "stopArray" parameter with an array of words or punctuation marks as shown below:

  • =NORMALIZE(A1:A3000, B1:B10)

  • =NORMALIZE(A1:A3000, "company, https, ltd, limited, org")

  • =NORMALIZE(A1:A3000, "-, &, .")

This means that the argument passed to the "stopArray" parameter can be a range, values typed directly inside the formula as shown above or a series of values inside a single cell (space separated for words or no spaces for punctuation marks).

Using the Long Run Mode

    1. Head to Extensions > Flookup > Long Run Mode > NORMALIZE in your spreadsheet menu.

    2. Select the mode you want this function to run. It can either be Remove diacritical marks, Remove punctuation marks or Remove unwanted words.

    3. Range of entries to normalize: Select a column of entries you want to normalise and then click "Get selected range".

    4. Range of entries to be removed: Select a single cell or column that contains your list of unwanted words or punctuation marks and then click "Get selected range".

    5. Click an empty cell where you want your results to be displayed.

    6. Click "Normalize text entries".

Key Points

  • NORMALIZE will create a new list without modifying the original list. It is therefore advised that you use this function in an empty column or sheet.

  • The stopArray parameter is optional and, therefore, if you do not include it in your formula, the function will default to removing diacritics.

  • You can select the unwanted words or punctuation marks contained in a single cell or single column. Please do not mix punctuation marks and unwanted words in your stopArray input.

  • When NORMALIZE LRM times out, the results that have been processed up to that point will be displayed.

  • NORMALIZE LRM will only remove diacritical marks if the "Range of entries to be removed" field is inactive.

  • NORMALIZE LRM will normalize the leftmost column of that range if you select a range of more than one column. To normalize text from any column in a range, you need to select a single column.

FUZZYMATCH

=FUZZYMATCH(string1, string2)

Use the FUZZYMATCH function to calculate the percentage similarity between text entries and return the result in decimal form.

FUZZYMATCH Parameters

    • string1 [Required]. The primary text entry to be compared to string2.

    • string2 [Required]. The secondary text entry to be compared to string1.

Using the Long Run Mode

    1. Head to Extensions > Flookup > Long Run Mode > FUZZYMATCH in your spreadsheet menu.

    2. Primary range: Select range of one or more columns and click "Get selected range".

    3. Secondary range: Select a separate range of one or more columns and then click "Get selected range".

    4. Index One: Enter the index of the column of values in "Primary range" that you want to compare to values from "Secondary range".

    5. Index Two: Enter the index of the column of values in "Secondary range" that you want to compare to values from "Primary range".

    6. Click an empty cell in a column where you want your results to be displayed.

    7. Click "Calculate percentage similarities".

Key Points

    • FUZZYMATCH parameters can have interchanged arguments without affecting the result e.g. =FUZZYMATCH(A1, B1) is the same as =FUZZYMATCH(B1, A1).

    • FUZZYMATCH parameters can take ranges as arguments. It can compare a single cell in one column to all the cells in another column or cells in one column to cells in another column that are on the same relative row e.g. =FUZZYMATCH(A1, B1:B100) or =FUZZYMATCH(A1:A100, B1:B100).

  • When FUZZYMATCH LRM times out, the results that have been processed up to that point will be displayed.

  • FUZZYMATCH LRM will compare cells that are on the same row to one another e.g. A1 to B1, A2 to B2 and so on.

  • When using FUZZYMATCH LRM, "Primary range" cannot be longer than "Secondary range".