ETL, What is it?

ETL, What is it?
Image by Danni Liu, adapted from attjeacock, Darkdiamond67, and Anupong/ Canva

My original plan for the week was to blog about cleaning and transforming data using Microsoft Power Query, but I decided to defer it just a tad. I want to cover the topic of ETL first because Power Query is a lightweight ETL.

As you deepen your experience in analytics, you’ll come across the acronym ETL, which stands for Extract, Transform and Load. Its origin is around data warehousing. When I first came across the term without any deep understanding of it, I remember thinking it sounded very technical and must be for super technical people like data engineers. I don’t need to be concerned with it at all.

It was only when someone outlined the process of ETL to me that it dawned on me that I was performing ETL using Power Query. So, even if you’re not planning to head down the career path of data engineers and intend to be an analyst, it is still helpful for you to understand what it is.
So, in this blog, I would like to provide an overview of ETL by covering the following questions:

  • What are ETL and ETL tools
  • Why is ETL important
  • What is the process of ETL

What are ETL and ETL Tools?

As mentioned earlier, ETL has roots in data warehousing. It refers to the process by which data is extracted from disparate data sources that are not optimized for analytics and moved to a central location so that it is fit for analysis and business intelligence task.

ETL tools are software designed to support ETL processes. It is used to extract data from disparate sources, clean it for consistency and quality, and consolidate it into a central location, typically a data warehouse for analysis and other uses. Some examples of ETL tools are: AWS Glue, Stich, Tableau Prep, Talend, Trifacta, and Matillion.

Here is an image of Tableau Prep:
Blog-18---Tableau-Prep

Why is ETL Important?

Many would have heard that data is the new crude oil. It is the foundation of the modern business world. However, data on its own isn’t useful. It’s through business intelligence that we can derive value from it. Data is rarely structured in a form suited for analysis. Often, we spend a lot of effort pulling disparate data sources and shaping and transforming the data into a format suitable for analysis. The common statistic quoted is up to 80% of the time spent on data preparation. ETL tool is essential because it reduces the time spent in data preparation. That means faster access to information and insights for decision-making.

What is the process of ETL?

ETL process can be categorized into the following stages:

  1. Data Extraction
  2. Data Transformation
  3. Data Load
ETL Process, Image from Stitch Data

1. Data Extraction

Most businesses have multiple data streams from different sources and formats. Data can be structured like your databases and unstructured like documents, emails, and images. Extraction is how we pull all these disparate data together and hold them in a staging area (temporary storage) so that we can apply the subsequent steps of scrubbing and transforming the data into a usable format.

2. Data Transformation

Following the collection of the data, it needs to be processed. As data comes from various systems and in various formats, we need to manipulate it and convert it into a standard format. This stage involves tasks like:

  • Filtering- keeping only specific attributes of data
  • Cleaning-dealing with null values or missing values, removal of major errors and inconsistencies
  • Joining- combining data from multiple columns into one
  • Splitting- splitting a single column into many columns
  • Pivoting/ unpivoting- rotate data from columns to row and vice versa

Data Load

The last step is loading the data to the end target location, which may be a simple delimited flat file or data warehouse. There are two common methods for loading data: full loading and incremental loading

Full loading is where every single datapoint is populated with content. i.e. erasing the contents and reloading with fresh data.

**Incremental loading ** is where incoming data is compared to existing data, and unique information is added as new records.

That’s a wrap. I hope you now understand what ETL is, why it’s important, and the three-stage process of ETL. There are many open-source ETL tools that you can play around Check out this article for a list of open-source ETL tools. I don’t have any experience with any of the listed tools. I’ve only used Tableau Prep, Trifacta and lightweight ETL Power Query. They all have a very friendly user interface. I love ETL tools. One of the key reasons why I love it so much is the time-saving factor.