What is an ETL Pipeline?
ETL is a foundational concept in data engineering. It stands for Extract, Transform, and Load, and it describes the process of moving data from one system to another.
- Extract: This is the first step, where you pull raw data from its source. The source could be anything: a production database (like PostgreSQL), a third-party API (like the Stripe API), a set of log files, or CSVs in a storage bucket.
- Transform: Raw data is rarely in the perfect format for analysis. The transform step is where you clean it up. This can involve:
- Cleaning: Handling missing values or correcting data types.
- Enriching: Joining the data with another dataset to add more context.
- Aggregating: Grouping data and calculating sums, averages, or counts.
- Load: In the final step, you load the newly transformed data into its destination. This is often a data warehouse (like BigQuery or Snowflake) or a data lake, where it can be used for analytics, reporting, and machine learning.
Why You Need an Orchestrator
You could write a simple Python script to perform an ETL job and run it on a schedule using cron. But what happens when things get complex?
- What if Task B can only run after Task A succeeds? (Dependencies)
- What if a temporary network issue causes a task to fail? (Retries)
- How do you get alerted if the entire pipeline fails? (Monitoring & Alerting)
- How can you see the status of all your pipelines in one place? (Centralized UI)
This is where a workflow orchestrator like Apache Airflow comes in.
Introducing Apache Airflow
Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. You don't drag-and-drop boxes; you define your pipelines in Python code.
Core Concepts
- DAG (Directed Acyclic Graph): The core of Airflow. A DAG is a Python script that defines your pipeline. It's a collection of all the tasks you want to run, organized in a way that shows their relationships and dependencies. It's "Directed" because tasks flow in one direction, and "Acyclic" because it cannot have loops.
- Operator: An Operator defines a single task in a DAG. Airflow has many pre-built operators:
- BashOperator: Executes a bash command.
- PythonOperator: Calls a Python function.
- PostgresOperator: Executes a SQL query against a PostgreSQL database.
- ...and thousands more for every popular service.
- Task: A task is a specific instance of an operator that runs at a certain time.
Building a Simple DAG
Let's create a basic DAG with three tasks: extract, transform, and load. Each will be a simple BashOperator that just prints a message.
dags/my_first_dag.py
Python
from __future__ import annotations
import pendulum
from airflow.models.dag import DAG
from airflow.operators.bash import BashOperator
# Define the DAG
with DAG(
dag_id="my_first_dag",
start_date=pendulum.datetime(2025, 1, 1, tz="UTC"),
schedule="@daily", # Run this pipeline once per day
catchup=False,
tags=["example"],
) as dag:
# Define the first task (Extract)
extract_task = BashOperator(
task_id="extract",
bash_command="echo 'Extracting data...'",
)
# Define the second task (Transform)
transform_task = BashOperator(
task_id="transform",
bash_command="echo 'Transforming data...'",
)
# Define the third task (Load)
load_task = BashOperator(
task_id="load",
bash_command="echo 'Loading data...'",
)
# Set the task dependencies
# This means: extract_task must complete successfully before transform_task can start, and so on.
extract_task >> transform_task >> load_task
When you place this file in your Airflow dags folder, the Airflow UI will pick it up and you'll see a visual representation of your pipeline, ready to be scheduled.