HOW TO PREPARE YOUR DATASET_

Optimise the Process

Fuzzy matching algorithms are generally processor intensive. Therefore, to speed up the process in Flookup and improve your overall success rate, try the following:

    • Use NORMALIZE to remove unwanted words or punctuation marks that negatively affect the accuracy of the results, and to reduce the workload.

    • Use the FUZZYMATCH function to get a feel of the scale Flookup uses to grade percentage similarities between text entries. All our lookup functions, except SOUNDMATCH, depend on this function.

    • Set the threshold value (i.e. the level of similarity) to 1 in order to eliminate the "low-hanging fruit" during your first run. You can gradually lower it for your subsequent runs.

    • Enter ranges (e.g. A1:A1000) as the lookup value in order to significantly improve speed and efficiency. Doing this also prevents the spreadsheet from hanging.

    • Take advantage of the Long Run Mode. Custom functions are given exactly 30 seconds to return results. However, in Long Run Mode (LRM), they can run for 6 minutes before timing out. To access this feature, click on the menu items located under Flookup > Long Run Mode.

NORMALIZE

=NORMALIZE(lookupValue, [stopArray])

Use the NORMALIZE function to remove diacritical marks from text entries before applying FUZZYMATCH or any of the lookup functions. We highly recommend that you begin by using this function before running any further analysis on your data.

The lookupValue parameter can either be:

    • A single cell e.g. =NORMALIZE(A1) or,

    • A range of cells e.g. =NORMALIZE(A1:A3000)

You can also use the following function to remove unwanted words or punctuation marks. This is useful if you have words or punctuation marks that you would like Flookup to ignore during processing. To do this, you need to feed the stopArray parameter with an array of words or punctuation marks you want to remove as shown below:

  • =NORMALIZE(A1:A3000, B1:B10)

  • =NORMALIZE(A1:A3000, "company, https, ltd, limited, org")

  • =NORMALIZE(A1:A3000, "-, &, .")

The second argument can be a range containing the unwanted words or punctuation marks (e.g. B1:B10) or a list of directly typed words or punctuation marks.

Note A

  • NORMALIZE will create a new list without modifying the original list. It is therefore advised that you use this function in an empty column or sheet.

  • The stopArray parameter is optional and, therefore, if you do not include it in your formula, the function will default to removing diacritical marks.

Using the Long Run Mode

    1. Range of entries to normalize: Select a column of entries you want to normalise and then click "Get selected range".

    2. Range of entries to be removed: Select a single cell or column that contains your list of stop words or punctuation marks and then click "Get selected range".

    3. Click an empty cell where you want your results to be displayed.

    4. Click "Normalize text entries".

Note B

  • The NORMALIZE LRM has three different modes:

    1. Remove diacritical marks: Flookup removes diacritical marks from text entries found in "Range of values to be removed".

    2. Remove punctuation marks: Flookup removes punctuation marks from text entries found in "Range of values to be removed".

    3. Remove unwanted words: Flookup removes unwanted words from text entries found in "Range of values to be removed".

  • When the NORMALIZE LRM function times out, the results that have been processed up to that point will be displayed.

  • If the "Range of entries to be removed" field is inactive, the function will only remove diacritical marks.

  • To normalize text from any column in a range, you need to select a single column. If you select a range of more than one column, the leftmost column of that range will be normalized.

  • You can select the stop words or punctuation marks in a single cell with space separated values or as a single column of values.

FUZZYMATCH

=FUZZYMATCH(string1, string2)

Use the FUZZYMATCH function to calculate the percentage similarity between text entries and return the result in decimal form.

FUZZYMATCH Parameters

    • string1 [Required]. The primary text entry to be compared to string2.

    • string2 [Required]. The secondary text entry to be compared to string1.

Note A

    • The two arguments can be interchanged without affecting the result.

    • Both parameters can also be ranges e.g. =FUZZYMATCH(A1:A2000, B1:B2000) OR =FUZZYMATCH(A1, B1:B2000). The first formula compares a cell in one column, to a cell in a second column which is in the same row and the second formula compares a single cell in one column to all the cells in a second column.

    • FUZZYMATCH supports multiple languages.

Using the Long Run Mode

    1. Primary range: Select range of one or more columns and click "Get selected range".

    2. Secondary range: Select a separate range of one or more columns and then click "Get selected range".

    3. Target column: Enter the index of the column of values in "Primary range" that you want to compare to values from "Secondary range".

    4. Secondary column: Enter the index of the column of values in "Secondary range" that you want to compare to values from "Primary range".

    5. Click an empty cell in a column where you want your results to be displayed.

    6. Click "Calculate percentage similarities".

Note B

    • When the FUZZYMATCH LRM function times out, the results that have been processed up to that point will be displayed.

    • This mode will compare cells on the same row to one another e.g. A1 to B1, A2 to B2 and so on.

    • The "Primary range" cannot be longer than "Secondary range".

Further Reading

Now that your data is fully prepared, please follow any of the links below to read more of our documentation: