REMOVING DUPLICATES BY TEXT SIMILARITY
Introduction to Removing Duplicates
To remove duplicates from a single column using Flookup, go to Extensions > Flookup Data Wrangler > Transformation functions > Remove duplicates in your spreadsheet menu.
Removing Duplicates by Percentage or Sound Similarity
- Select the function to run
Click the menu item labelled By percentage or By sound, depending on your needs. - Select the mode to run
Choose from the first drop-down menu:- Keep first unique value
- Keep last unique value
- Select the text entries to analyse
Select one or more columns. If you select a range (e.g. A2:D500) and duplicates are identified on a row, all columns in that row will be removed. - Index the selected data
Click Map columns in selection to index your columns. - Specify the column of data to analyse
Enter the Left_column index. If left blank, the first column is analysed. - Enter the level of similarity
(Only for "By percentage") Enter the Threshold value. Higher values mean only close matches are considered duplicates; lower values are more permissive. - Remove Duplicates
Click the Remove duplicates button.
How to Remove Duplicates Across Two Different Columns
- Select the function to run
Click By percentage or By sound. - Select the mode to run
Choose from the first drop-down menu:- Keep first unique value
- Keep last unique value
- Select the comparison mode
Select Compare two different columns from the second drop-down. - Select the data to compare
Select text entries of two or more columns. This determines the number of columns deleted for each duplicate row. - Index the selected data
Click Map columns in selection. - Specify the column indexes to analyse
Enter your Left_column and Right_column index. - Set the level of similarity
(Only for "By percentage") Adjust the Threshold value as needed. - Remove duplicates
Click Remove duplicates.
How to Remove Duplicate Rows
- Select the function to run
Click By percentage or By sound. - Select the mode to run
Choose from the first drop-down menu:- Keep first identified duplicate value
- Keep last identified duplicate value
- Select the comparison mode
Select Compare data in selection by row from the second drop-down. - Select the data to compare
Select a data range of two or more columns to be analysed for duplicates. - Index the selected data
Click Map columns in selection. - Set the level of similarity
(Only for "By percentage") Adjust the Threshold value as needed. - Remove duplicates
Click Remove duplicates.
How to Remove Duplicates of Data in a Single Cell
- Click By percentage or By sound.
- Select Remove duplicates by cell value from the second drop-down.
- Click a single cell containing the content whose duplicates you wish to remove and click Grab selected cell.
- Select the data range to be analysed and click Map columns in selection.
- Change the Left_column value to specify the column index to remove duplicates from.
- (Only for "By percentage") Adjust the Threshold value as needed.
- Click Remove duplicates.
How to Roll Up Data from Duplicate Rows
- Click By percentage.
- Select Roll up data in selection by row from the second drop-down.
- Select the data range of two or more columns to be analysed for duplicates.
- Click Map columns in selection.
- (Only for "By percentage") Adjust the Threshold value as needed.
- Click Remove duplicates.
How to Extract Unique Values
- Select the function to run
Click By percentage or By sound. - Select the mode to run
Choose from the first drop-down menu:- Keep first identified duplicate value
- Keep last identified duplicate value
- Select the comparison mode
Select Compare data in selection by row from the second drop-down. - Select the data to compare
Select a data range of two or more columns to be analysed for duplicates. - Index the selected data
Click Map columns in selection. - Set the level of similarity
(Only for "By percentage") Adjust the Threshold value as needed. Higher values are stricter, lower values are more permissive. - Click Remove duplicates.
Notes on Removing Duplicates
- The Left_column value is the only column analysed in single-column mode.
- For two-column mode, remove duplicates within the Left_column first for best results.
- Duplicates are values in Left_column that exist in Right_column and any row with a duplicate will be deleted.
- For row comparison, all columns in a duplicate row are deleted.
- Threshold controls how strict the function is: higher = stricter, lower = more permissive.
- After running, a message will indicate how many rows were processed.