DATA CLEANING IN GOOGLE SHEETS USING GPT MODELS
Data Cleaning in the Era of AI
Data cleaning is a crucial step in any data analysis process. It involves removing duplicates, correcting errors, and standardizing data formats. However, this process can be time-consuming and tedious. But what if I told you that there is a way to automate this process using artificial intelligence? Enter GPTs, a powerful AI family of models developed by OpenAI and the technology behind ChatGPT, which can be used for data cleaning in Google Sheets.
What is a GPT Model?
A Generative Pretrained Transformer (GPT) is a type of Large Language Model (LLM) and a prominent framework for generative artificial intelligence. It is capable of understanding and generating human-like responses based on the input it receives.
LLMs are part of the field of Natural Language Processing and the study of semantics. Its roots can be traced back to the pioneering work of French philologist, Michel Bréal, who introduced the concept of semantics back in 1883. The evolution of LLMs took a significant leap forward in 2017 when Google researchers unveiled the transformer architecture at the NeurIPS conference, which has since become a fundamental component of modern LLMs.
At its core, a GPT model leverages machine learning techniques to process and generate text based on the input it receives. It is trained on a diverse range of internet text, enabling it to respond to a wide array of prompts in a manner that is coherent and contextually relevant. This makes it an excellent tool for automating various tasks, including data cleaning. Today, one of the most popular GPTs is ChatGPT. It was first developed by OpenAI and publicly released on the 30th of November 2022.
How Can a GPT Model Help with Data Cleaning?
Removing duplicates: It has the ability to identify and remove duplicate text entries in your data, ensuring that each entry is truly unique.
Standardizing formats: It can help standardize data formats across your spreadsheet. This means it can make sure that all your data follows the same desired format, making it easier to read and analyse.
Filling in missing values: It can refer to external data sources or use interpolation methods to fill in missing values in your dataset.
Error correction: It can identify errors in your data and suggest corrective actions. This helps improve the accuracy and reliability of your data.
Merits of using a GPT Model for Data Cleaning
Efficient automation: By leveraging the capabilities of a GPT model, you can fully automate the data cleaning process. This not only saves you significant time and effort but also allows you to focus on more complex tasks that require your expertise.
High levels of accuracy: A GPT model is designed to ensure high levels of accuracy in your data. By identifying and correcting errors, it reduces the risk of inaccuracies in your results, leading to more reliable insights and decision-making.
Impressive scalability: One of the key strengths of a GPT model is its ability to handle large datasets. Manual data cleaning can be time-consuming and prone to errors, especially with large volumes of data. A GPT, on the other hand, can process and clean these datasets efficiently, making it a scalable solution for your data cleaning needs.
Extensive customizability: A GPT model does not stick to a one-size-fits-all solution. You can customize the prompts to suit your specific data cleaning needs. Whether you need to standardize formats, fill in missing values or remove duplicates, you can tailor GPT model’s operations to meet your requirements.
Ease of use: You do not need to be a data cleaning expert to clean data with a GPT model. With its natural language processing capabilities, you can give instructions in plain English and other languages.
Demerits of using a GPT Model for Data Cleaning
Inherent complexity: The process of writing and debugging code generated by a GPT model can be challenging, particularly for those who do not have a background in programming. This complexity presents a steep learning curve and may necessitate additional time and resources to master.
Dependence on an external API: This approach heavily relies on the availability and performance of the API. Any congestion, downtime or performance issues with the API could disrupt your data cleaning process, potentially leading to delays and inefficiencies.
Potential for operational costs: Using the API is not free, and extensive use could lead to high operational costs. It is important to consider these costs when planning your data cleaning strategy, especially for large-scale projects.
Data privacy concerns: Sending sensitive data to an external API could pose data privacy concerns. It is crucial to ensure that any sensitive data is properly anonymized or encrypted before sending it to the API to protect your data and comply with privacy regulations.
Making API Calls from Google Sheets to a GPT Model
Google Sheets, a versatile tool, is not only useful for data entry and analysis, but also for interacting with APIs, including those of GPT models. Here is an example of what your code might look like:
In this function, we are using the powerful UrlFetchApp service make the request from inside the Apps Script environment, but you will need to replace "YOUR_OPENAI_API_KEY" with your actual OpenAI API key. This function makes a POST request to the OpenAI API, sending a JSON payload that contains your prompt and the maximum number of tokens to generate.
You can call this function in any cell in your Google Sheet by typing =callGPTModel(text) where "text" can be the actual prompt or a reference to a cell with the prompt.
Potential Applications
Generation of content: You can generate ideas for blog posts, draft emails, create posts for social media and other online content.
Translation: As demonstrated in the example above, you can translate text from one language to another.
Coding assistance: You can receive help with coding problems. Simply input your problem, and the GPT model can provide a solution.
Learning and education: You can use it to learn new topics by asking the model to explain a complex topic in simple terms.
Data Cleaning Using Flookup and GPT Models - Coming Soon
You will soon be able to use Flookup to clean data using GPT 3.5 Turbo or GPT 4.0 Turbo. It will be preconfigured to handle tasks like removing duplicates, standardizing data and fuzzy matching. To access it, head to Extensions > Flookup Data Wrangler > Process data with OpenAI™.
Why Traditional Data Cleaning Algorithms Still Matter
While AI models like GPT-4 offer innovative ways to automate data cleaning, using traditional algorithms via Google Apps Script for data cleaning has its own unique advantages:
Google workspace integration: Google Apps Script is deeply integrated with Google Workspace, making it easy to interact with data in Google Sheets, Docs, Slides and much more.
Transparency: Google Apps Script is easy to read and understand, providing full transparency into your data cleaning process.
No external dependencies: With Google Apps Script, all your data and processing stay within the Google ecosystem, eliminating potential issues with availability, performance, and cost associated with external APIs.
Data privacy: Since your data does not leave the Google ecosystem when using Google Apps Script, there are fewer concerns about data privacy.
Easy customisation: Google Apps Script can be easily customized to handle specific data cleaning tasks, providing a level of flexibility that most AI models may not immediately offer.
Final Thoughts
Data cleaning does not have to be a tedious process. By combining the power of GPT and Google Sheets, you can automate this process, saving you time and ensuring that your data is clean and ready for analysis.
Please remember that the specifics of how you can use a GPT model for data cleaning will depend on your particular use case and dataset. It is always a good idea to experiment with different approaches and see what works best for your needs.