What is an ETL Pipeline?

ETL is a foundational concept in data engineering. It stands for Extract, Transform, and Load, and it describes the process of moving data from one system to another.

  1. Extract: This is the first step, where you pull raw data from its source. The source could be anything: a production database (like PostgreSQL), a third-party API (like the Stripe API), a set of log files, or CSVs in a storage bucket.
  2. Transform: Raw data is rarely in the perfect format for analysis. The transform step is where you clean it up. This can involve:
  • Cleaning: Handling missing values or correcting data types.
  • Enriching: Joining the data with another dataset to add more context.
  • Aggregating: Grouping data and calculating sums, averages, or counts.
  1. Load: In the final step, you load the newly transformed data into its destination. This is often a data warehouse (like BigQuery or Snowflake) or a data lake, where it can be used for analytics, reporting, and machine learning.

Why You Need an Orchestrator

You could write a simple Python script to perform an ETL job and run it on a schedule using cron. But what happens when things get complex?

  • What if Task B can only run after Task A succeeds? (Dependencies)
  • What if a temporary network issue causes a task to fail? (Retries)
  • How do you get alerted if the entire pipeline fails? (Monitoring & Alerting)
  • How can you see the status of all your pipelines in one place? (Centralized UI)

This is where a workflow orchestrator like Apache Airflow comes in.

Introducing Apache Airflow

Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. You don't drag-and-drop boxes; you define your pipelines in Python code.

Core Concepts

  • DAG (Directed Acyclic Graph): The core of Airflow. A DAG is a Python script that defines your pipeline. It's a collection of all the tasks you want to run, organized in a way that shows their relationships and dependencies. It's "Directed" because tasks flow in one direction, and "Acyclic" because it cannot have loops.
  • Operator: An Operator defines a single task in a DAG. Airflow has many pre-built operators:
  • BashOperator: Executes a bash command.
  • PythonOperator: Calls a Python function.
  • PostgresOperator: Executes a SQL query against a PostgreSQL database.
  • ...and thousands more for every popular service.
  • Task: A task is a specific instance of an operator that runs at a certain time.

Building a Simple DAG

Let's create a basic DAG with three tasks: extract, transform, and load. Each will be a simple BashOperator that just prints a message.

dags/my_first_dag.py

Python


from __future__ import annotations

import pendulum

from airflow.models.dag import DAG
from airflow.operators.bash import BashOperator

# Define the DAG
with DAG(
    dag_id="my_first_dag",
    start_date=pendulum.datetime(2025, 1, 1, tz="UTC"),
    schedule="@daily", # Run this pipeline once per day
    catchup=False,
    tags=["example"],
) as dag:
    # Define the first task (Extract)
    extract_task = BashOperator(
        task_id="extract",
        bash_command="echo 'Extracting data...'",
    )

    # Define the second task (Transform)
    transform_task = BashOperator(
        task_id="transform",
        bash_command="echo 'Transforming data...'",
    )

    # Define the third task (Load)
    load_task = BashOperator(
        task_id="load",
        bash_command="echo 'Loading data...'",
    )

    # Set the task dependencies
    # This means: extract_task must complete successfully before transform_task can start, and so on.
    extract_task >> transform_task >> load_task

When you place this file in your Airflow dags folder, the Airflow UI will pick it up and you'll see a visual representation of your pipeline, ready to be scheduled.