Building ETL pipelines (Airflow intro)

What is an ETL Pipeline?

ETL is a foundational concept in data engineering. It stands for Extract, Transform, and Load, and it describes the process of moving data from one system to another.

Extract: This is the first step, where you pull raw data from its source. The source could be anything: a production database (like PostgreSQL), a third-party API (like the Stripe API), a set of log files, or CSVs in a storage bucket.
Transform: Raw data is rarely in the perfect format for analysis. The transform step is where you clean it up. This can involve:

Cleaning: Handling missing values or correcting data types.
Enriching: Joining the data with another dataset to add more context.
Aggregating: Grouping data and calculating sums, averages, or counts.

Load: In the final step, you load the newly transformed data into its destination. This is often a data warehouse (like BigQuery or Snowflake) or a data lake, where it can be used for analytics, reporting, and machine learning.

Why You Need an Orchestrator

You could write a simple Python script to perform an ETL job and run it on a schedule using cron. But what happens when things get complex?

What if Task B can only run after Task A succeeds? (Dependencies)
What if a temporary network issue causes a task to fail? (Retries)
How do you get alerted if the entire pipeline fails? (Monitoring & Alerting)
How can you see the status of all your pipelines in one place? (Centralized UI)

This is where a workflow orchestrator like Apache Airflow comes in.

Introducing Apache Airflow

Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. You don't drag-and-drop boxes; you define your pipelines in Python code.

Core Concepts

DAG (Directed Acyclic Graph): The core of Airflow. A DAG is a Python script that defines your pipeline. It's a collection of all the tasks you want to run, organized in a way that shows their relationships and dependencies. It's "Directed" because tasks flow in one direction, and "Acyclic" because it cannot have loops.
Operator: An Operator defines a single task in a DAG. Airflow has many pre-built operators:
BashOperator: Executes a bash command.
PythonOperator: Calls a Python function.
PostgresOperator: Executes a SQL query against a PostgreSQL database.
...and thousands more for every popular service.
Task: A task is a specific instance of an operator that runs at a certain time.

Building a Simple DAG

Let's create a basic DAG with three tasks: extract, transform, and load. Each will be a simple BashOperator that just prints a message.

dags/my_first_dag.py

Python

from __future__ import annotations

import pendulum

from airflow.models.dag import DAG
from airflow.operators.bash import BashOperator

# Define the DAG
with DAG(
    dag_id="my_first_dag",
    start_date=pendulum.datetime(2025, 1, 1, tz="UTC"),
    schedule="@daily", # Run this pipeline once per day
    catchup=False,
    tags=["example"],
) as dag:
    # Define the first task (Extract)
    extract_task = BashOperator(
        task_id="extract",
        bash_command="echo 'Extracting data...'",
    )

    # Define the second task (Transform)
    transform_task = BashOperator(
        task_id="transform",
        bash_command="echo 'Transforming data...'",
    )

    # Define the third task (Load)
    load_task = BashOperator(
        task_id="load",
        bash_command="echo 'Loading data...'",
    )

    # Set the task dependencies
    # This means: extract_task must complete successfully before transform_task can start, and so on.
    extract_task >> transform_task >> load_task

When you place this file in your Airflow dags folder, the Airflow UI will pick it up and you'll see a visual representation of your pipeline, ready to be scheduled.

LearnCodePro

Building ETL pipelines (Airflow intro)

What is an ETL Pipeline?

Why You Need an Orchestrator

Introducing Apache Airflow

Core Concepts

Building a Simple DAG

Intro to distributed computing concepts

PySpark basics: RDD vs DataFrame

Streaming basics with Kafka / Spark Streaming

Quick Navigation

This Series

Topics in Data Science, Machine Learning & AI

Categories

Learn More

Want to Track Your Progress?

What is an ETL Pipeline?

Why You Need an Orchestrator

Introducing Apache Airflow

Core Concepts

Building a Simple DAG

More in Big Data & Data Engineering

Intro to distributed computing concepts

PySpark basics: RDD vs DataFrame

Streaming basics with Kafka / Spark Streaming

Quick Navigation

This Series

Topics in Data Science, Machine Learning & AI

Categories

Learn More

Want to Track Your Progress?