PREPROCESS DATA BY TEXT SIMILARITY

Introduction to Data Preprocessing

In this guide, you will learn how to use two powerful Flookup functions that can make your data cleaning easier, faster and susceptible to fewer errors: NORMALIZE and FUZZYMATCH.

NORMALIZE improves the quality and consistency of your data by removing or formatting text entries that might interfere with the fuzzy matching process.

FUZZYMATCH helps you understand your data better by showing you how similar your text entries are. It also gives you a glimpse of the underlying mechanism that drives the other Flookup functions.

NORMALIZE

This function can either modify the original dataset in place or leave the original dataset unchanged. Here is a condensed look at what each function mode does:


-----

To normalize text entries by removing punctuation marks or removing unwanted words, follow the steps below:


-----

To normalize text entries by removing diacritical marks, keeping URL domain or keeping the URL path, follow the steps below:


-----

Notes On Normalizing Data


-----

NORMALIZE Custom Function

FUZZYMATCH


-----

Notes On Comparing Text for Similarity


-----

FUZZYMATCH Custom Function 

For the Visual Learners