Messy data holds you back from contributing meaningfully to Wikibase. Whether it is typos, duplicates or inconsistent formats, getting your dataset ready for projects like Wikidata can feel like a chore.
That is where Flookup comes in, a free Google Sheets add-on that improves prepping and cleaning data with the help of advanced fuzzy matching algorithms. It takes what Google Sheets already provides by default, adds a bit to it and enhances everything significantly.
In this tutorial, we will take a list of museums and guide you step-by-step through cleaning it, formatting it to align with Wikidata’s structured requirements, so you can confidently apply the same process to your own datasets.
Before we proceed, you will need to have Flookup installed. Please refer to this installation guide in case you need to.
Otherwise, let us work with a sample dataset, a list of museums destined for Wikibase. Here is what it might look like:
What is Wrong Here?
Duplicates: “British Museum” and “Britsh Museum” are the same, just with a typo.
Inconsistent Cities: “London” vs. “Lndon”, “Paris” vs. “Paris, France”.
Naming Variations: “Louvre Museum” and “Musée du Louvre” refer to the same place.
Duplicates can clutter your Wikibase contribution, so let’s use Flookup fuzzy matching to spot and fix them.
Select Your Column: Highlight the “Museum Name” column (A2:A7).
Launch the highlight function: Go to Extensions > Flookup Data Wrangler > Highlight duplicates.
Set the similarity threshold to 0.8 because this catches close matches without being too strict.
Click "Highlight" to execute the function. This will highlight all duplicates in the selection.
What You Will See:
“British Museum” and “Britsh Museum” flagged as a match due to their high similarity score.
“Louvre Museum” and “Musée du Louvre” might also pair up, depending on the threshold.
Clean Up the Duplicates:
Review Matches: Check Flookup’s output to confirm the pairs make sense.
Merge or Delete: Delete the duplicate row or use the Merge Data feature to combine useful info from both into one row.
Select Your Column: Highlight the “Museum Name” column once again.
Launch the highlight function: Go to Extensions > Flookup Data Wrangler > Remove duplicates.
Set the similarity threshold to 0.8 because this is the value we used to identify the duplicates in the first place.
Click "Remove duplicates " to execute the function. This will remove all identified duplicates.
PRO TIP: If fuzzy matching misses something, lower the threshold to 0.7, but double-check for false positives.
Wikibase thrives on consistency, so let us ensure that the city names and formats meet the desired standard by using AI-powered feature of Flookup.
Overview of How to Do It
Select Your Column: Highlight the “City” column in the Google Sheet.
Launch the AI sidebar: Head to Extensions > Flookup Data Wrangler > Intelligent data cleaning.
Set the Mode: From the sidebar, pick "STANDARDIZE DATA" to inject consistency into your dataset.
Give the AI instructions: In the prompt box, type a clear, concise command like: “Standardise city names to lowercase, remove commas and country names.”
Run the function: Click "Submit Prompt" and let the AI do its magic. It will work through your selected range and return clean results in a location that you specify.
Check the output: Review the standardised cities (e.g. “lndon” becomes “london”, “paris france” becomes “paris”).
You can use this AI feature to standardise your datasets using a multitule of other operations.
For example, for dates, highlight the column, select "TRANSFORM DATA", and prompt: "Convert years to YYYY-MM-DD format, assume January 1st" to reformat years like "1753" to "1753-01-01", in keeping with the Wikibase preference.
PRO TIP: Keep prompts short and sharp (e.g. “Remove punctuation from cities”). If the AI misses the mark, tweak the wording and try again.
Your data is almost ready. Let us make sure it is perfect.
Final Checklist
Duplicates are gone (e.g. one “British Museum”).
Cities are consistent (e.g. “london”, “paris”, “new york”).
Dates are formatted (e.g. “1753-01-01”).
No missing critical information.
Export for Wikibase
Go to File > Download > Comma-separated values (.csv).
Save your clean dataset. It is now ready for tools like QuickStatements or manual Wikibase entry.
QUICK TIP: Test a small batch in Wikibase first to catch any formatting quirks.
Flookup streamlines data preparation with efficiency and ease. Designed to integrate seamlessly with Google Sheets, it offers an intuitive interface that requires minimal training to master.
By the conclusion of this tutorial, you will have successfully deduplicated, standardised, and refined your dataset, ensuring it is fully prepared for integration into Wikibase.