Mastering Airflow: A Step-by-Step Guide to Automated ETL#

Airflow has become the de facto standard for orchestrating and automating data pipelines in modern data engineering. Built by Airbnb in 2014, Airflow empowers data engineers with a platform to programmatically author, schedule, and monitor workflows. If you need a resilient and scalable method to run Extract, Transform, and Load (ETL) jobs, Airflow is a formidable choice.

In this guide, we’ll walk through the fundamental concepts of Airflow—covering installation, basic DAG creation, and usage of operators—before diving into more advanced topics such as sensors, subDAGs, best practices, and production-level deployments. Hop along to learn how to build robust data pipelines with Airflow, step by step.

Table of Contents#

What Is Airflow?
Key Concepts and Terminology
Getting Started
- Installation and Setup
- Basic Configuration Files
Building Your First Airflow DAG
- DAG Definition
- Example: Simple DAG with PythonOperator
Working with Operators
Advanced Airflow Concepts
Managing Dependencies and Scheduling
- Triggers and External Task Dependencies
- Cron Expressions and Timed Schedules
Monitoring and Debugging
- Airflow UI
- Logging and Alerts
Scaling Airflow
Performance Optimization and Best Practices
Production-Ready Deployment
- Docker and Docker Compose
- Airflow in the Cloud
Conclusion
Further Reading

What Is Airflow?#

Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. Originally developed by Airbnb, it aims to provide:

A clean, configurable, and flexible way to define tasks and their dependencies.
A scheduler that executes tasks on an array of workers while respecting defined dependencies.
A user-friendly web interface for managing, monitoring, and troubleshooting tasks.

Airflow follows a configuration-as-code approach, meaning you define your data pipelines (commonly referred to as “workflows” or “ETL pipelines”) in Python scripts. This approach allows you to write complex business logic without the constraints of a purely GUI-based system.

Key Concepts and Terminology#

Directed Acyclic Graphs (DAGs)#

In Airflow, a Directed Acyclic Graph (DAG) is a collection of tasks organized to reflect their relationships and dependencies. The “acyclic” part means the graph cannot have loops—no task can depend on itself, directly or indirectly.

Nodes in the DAG represent tasks, and
Edges indicate the sequence in which those tasks must be executed.

Tasks and Operators#

A task is a self-contained unit of work, and an operator is a template defining what that task does. Operators come in various types:

BashOperator: Executes a bash command.
PythonOperator: Handles Python callable functions.
EmailOperator: Sends email notifications.
… (there are many more depending on your use cases and providers)

You can see operators as building blocks for your tasks. Each operator can be parameterized and orchestrated to fit your data pipeline.

Scheduling and Execution#

Airflow runs your DAGs on a scheduled basis (unless you trigger them manually). The scheduler looks at the start date, interval (or cron expression), and any time-based parameters to decide when your DAG should run. If the dependencies are met, tasks are placed in the queue, and an executor manages how tasks run in parallel or sequentially based on resources.

Getting Started#

Installation and Setup#

It’s easy to get started with Airflow using pip:

1
pip install apache-airflow

However, you must be aware of some constraints and best practices:

Using a dedicated Python virtual environment is strongly advised.
Make sure you have a proper version of Python (3.7+ for recent Airflow releases).
Be mindful of the Airflow version you install, as different versions have different dependencies.

Once installed, you’ll want to set the AIRFLOW_HOME environment variable. By default, Airflow uses ~/airflow as the home directory.

1
export AIRFLOW_HOME=~/my_airflow
2
airflow db init
3
airflow users create \
4
    --username admin \
5
    --password admin \
6
    --firstname Admin \
7
    --lastname User \
8
    --role Admin \
9
    --email admin@example.com

Finally, start the scheduler and web server:

1
airflow webserver -p 8080
2
airflow scheduler

You can now access the Airflow UI at http://localhost:8080.

Basic Configuration Files#

Airflow has a few files you should be aware of:

airflow.cfg: The main Airflow configuration file that controls the executor type, database connections, and more.
webserver_config.py: Contains the configuration for Airflow’s web interface.
requirements.txt (or environment management files): For specifying additional Python packages your DAGs depend on.

You typically store your DAGs in the dags folder within your AIRFLOW_HOME.

Building Your First Airflow DAG#

DAG Definition#

At its simplest, a DAG is defined by creating a Python file (e.g., simple_dag.py) within your dags folder that contains a DAG object. The timing configuration is defined in the DAG initialization, along with the starting date and schedule interval.

Let’s break it down with a quick example.

1
from datetime import datetime, timedelta
2
from airflow import DAG
3
from airflow.operators.python_operator import PythonOperator
4

5
default_args = {
6
    'owner': 'airflow',
7
    'depends_on_past': False,
8
    'email': ['airflow@example.com'],
9
    'email_on_failure': False,
10
    'email_on_retry': False,
11
    'retries': 1,
12
    'retry_delay': timedelta(minutes=5),
13
}
14

15
def say_hello():
16
    print("Hello from Airflow!")
17

18
with DAG(
19
    dag_id='simple_dag',
20
    default_args=default_args,
21
    description='A simple tutorial DAG',
22
    start_date=datetime(2023, 1, 1),
23
    schedule_interval='@daily',
24
    catchup=False,
25
) as dag:
26

27
    task_hello = PythonOperator(
28
        task_id='task_hello',
29
        python_callable=say_hello
30
    )
31

32
    task_hello

Example: Simple DAG with PythonOperator#

In the snippet above:

We import necessary modules (DAG, PythonOperator).
We define a dictionary of default_args to handle aspects like retries and email notifications.
We define a function say_hello() that prints a message.
We create a DAG context with the with statement, specifying parameters like start_date and schedule_interval.
We create a PythonOperator that references our say_hello() function.

Once the file is placed in your dags directory, Airflow automatically detects and parses it. You should see simple_dag in the Airflow UI under the list of DAGs.

Working with Operators#

Common Operators#

Airflow offers a variety of operators that can help with typical tasks. Here’s a brief table listing some commonly used operators:

Operator	Purpose
BashOperator	Execute bash commands
PythonOperator	Execute Python callables
EmailOperator	Send emails
SQL Operators	Execute SQL commands (e.g., PostgresHook)
DockerOperator	Run Docker containers
KubernetesPodOperator	Run tasks in Kubernetes Pods

Each operator can be fine-tuned based on parameters, like environment variables for BashOperator, or query strings for SQL-related operators.

Branching#

Branching is a powerful feature in Airflow that allows you to control the flow of tasks. You can use an operator like BranchPythonOperator to decide which path your DAG should take based on some condition:

1
from airflow.operators.python_operator import BranchPythonOperator
2

3
def _branching_logic():
4
    # Return the task_id of the next task to run
5
    if some_condition():
6
        return 'task_a'
7
    else:
8
        return 'task_b'
9

10
branching = BranchPythonOperator(
11
    task_id='branch_task',
12
    python_callable=_branching_logic,
13
    dag=dag
14
)
15

16
task_a = PythonOperator(
17
    task_id='task_a',
18
    python_callable=some_callable_a,
19
    dag=dag
20
)
21

22
task_b = PythonOperator(
23
    task_id='task_b',
24
    python_callable=some_callable_b,
25
    dag=dag
26
)
27

28
branching >> [task_a, task_b]

Templating#

Airflow’s templating feature uses Jinja to dynamically replace variables in your tasks at runtime. Often helpful for passing run dates ({{ ds }} or {{ execution_date }}) into tasks. For example:

1
templated_command = """
2
    echo "{{ ds }}"
3
    echo "{{ macros.ds_add(ds, 7) }}"
4
"""
5

6
task = BashOperator(
7
    task_id='templated_task',
8
    bash_command=templated_command,
9
    dag=dag
10
)

The above snippet would print the current run date and a date 7 days in the future.

Advanced Airflow Concepts#

Sensors#

Sensors are a special type of operator that “senses” for some condition to be met before running downstream tasks. Examples include:

FileSensor: Waits for a file to be present in a filesystem.
ExternalTaskSensor: Waits for another DAG’s task to complete.
S3KeySensor: Waits for a file to show up in an S3 bucket.

1
from airflow.sensors.filesystem import FileSensor
2

3
wait_for_file = FileSensor(
4
    task_id='wait_for_file',
5
    filepath='/tmp/data_ready.txt',
6
    poke_interval=30,
7
    timeout=600,
8
    dag=dag
9
)

The above sensor checks every 30 seconds if /tmp/data_ready.txt exists. If it isn’t found within 600 seconds, the task fails.

SubDAGs and Modular Pipelines#

A SubDAG is a DAG embedded in a parent DAG, meant to provide modularity and reusability. SubDAGs help manage complex workflows by breaking them into smaller, more maintainable pieces. However, subDAGs can sometimes complicate scheduling unless designed carefully.

An alternative approach in modern Airflow is to use TaskGroup, which provides a more streamlined way of grouping tasks under a semantic name without some of the limitations of subDAGs.

Custom Operators and Plugins#

If the built-in operators don’t meet your needs, you can create custom operators by extending Airflow’s BaseOperator. You can also develop plugins to package and distribute your custom functionality:

1
from airflow.models import BaseOperator
2

3
class MyCustomOperator(BaseOperator):
4
    def __init__(self, my_param, *args, **kwargs):
5
        super(MyCustomOperator, self).__init__(*args, **kwargs)
6
        self.my_param = my_param
7

8
    def execute(self, context):
9
        # Custom logic here
10
        self.log.info(f'My param is: {self.my_param}')

Plugins organize custom hooks, operators, sensors, and more into a cohesive package that can be easily shared.

TaskGroup in Airflow 2#

Introduced in Airflow 2, TaskGroup allows you to group tasks logically without the overhead of subDAGs. This is done by creating a grouping context:

1
from airflow.utils.task_group import TaskGroup
2

3
with TaskGroup("data_processing") as data_processing:
4
    task_1 = PythonOperator(
5
        task_id='task_1',
6
        python_callable=lambda: print("Task 1")
7
    )
8
    task_2 = PythonOperator(
9
        task_id='task_2',
10
        python_callable=lambda: print("Task 2")
11
    )
12

13
    task_1 >> task_2

You can then reference data_processing in the parent DAG to place it in the workflow:

1
start_task >> data_processing >> end_task

Managing Dependencies and Scheduling#

Triggers and External Task Dependencies#

Airflow allows DAGs to trigger other DAGs. You can configure triggers using operators like TriggerDagRunOperator. You might also use ExternalTaskSensor to wait for a particular task in an external DAG to complete before continuing.

1
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
2

3
trigger_dag = TriggerDagRunOperator(
4
    task_id='trigger_other_dag',
5
    trigger_dag_id='other_dag_id',
6
    execution_date='{{ ds }}',
7
    wait_for_completion=True,
8
    poke_interval=30,
9
    dag=dag
10
)

Cron Expressions and Timed Schedules#

A big part of Airflow is deciding how often your DAG runs. You can use built-in Airflow strings like @daily, @hourly, or @weekly, but you can also provide a cron expression to gain full control:

1
# Run every day at 2 AM
2
schedule_interval='0 2 * * *'

Monitoring and Debugging#

Airflow UI#

The Airflow UI provides a graphical representation of your DAGs, their runs, and task status. You can:

See a DAG’s structure in the Graph view.
View a run’s status in the Tree view.
Check logs for each task.
Trigger manual runs.
Pause DAGs.

Logging and Alerts#

Airflow logs every task’s output by default, storing logs either on the local file system or in services like S3, depending on your configuration. You can also configure email alerts, Slack notifications, or other custom alerting methods on task failure or retries.

1
default_args = {
2
    'email_on_failure': True,
3
    'email': ['airflow@example.com']
4
}

Scaling Airflow#

Executor Choices#

Airflow can be scaled vertically (more CPU, more memory on a single server) or horizontally (multiple workers). The choice of executor is crucial:

SequentialExecutor: Processes tasks sequentially (good for testing or small setups).
LocalExecutor: Runs tasks in parallel locally, using multiple processes.
CeleryExecutor: Distributes tasks across multiple worker nodes.
KubernetesExecutor: Creates dynamic worker pods in Kubernetes for tasks.

Celery Executor#

Celery is a distributed task queue that allows you to scale out worker processes horizontally. With the CeleryExecutor, you configure a message broker (often Redis or RabbitMQ) and result backend, enabling you to add or remove worker servers seamlessly.

Here’s a sample excerpt from airflow.cfg:

1
[core]
2
executor = CeleryExecutor
3

4
[celery]
5
broker_url = redis://localhost:6379/0
6
result_backend = db+postgresql://airflow:airflow@localhost:5432/airflow

Once configured, you can start workers using:

1
airflow celery worker

Kubernetes Executor#

The KubernetesExecutor spins up a new pod for each task, providing perfect isolation and scaling per task. You’ll need a Kubernetes cluster (e.g., self-managed or a cloud service like EKS, GKE, or AKS). In your airflow.cfg:

1
[core]
2
executor = KubernetesExecutor
3

4
[kubernetes]
5
namespace = airflow

Then, each time a task is scheduled, it launches a new pod in the airflow namespace, pulling any necessary images.

Performance Optimization and Best Practices#

Designing Efficient DAGs#

Keep DAGs small and focused: Each DAG should perform a coherent set of tasks.
Use cross-DAG dependencies wisely: Review your design to avoid overly complex interlocking DAGs.
Minimize overhead: Avoid running heavy computations directly in tasks. Instead, delegate such work to external systems or specialized services.

Avoiding Common Pitfalls#

Too many tasks in one DAG: Harder to manage, can slow the scheduler.
Excessive sensor usage: Sensors that poke frequently can overload the scheduler; consider asynchronous sensors.
Poorly tuned concurrency settings: If too low, tasks starve. If too high, you risk resource exhaustion.

Testing Strategies#

Airflow offers several methods for testing:

Local debug: Run tasks in isolation with airflow tasks test <DAG_ID> <TASK_ID> <EXECUTION_DATE>.
Unit testing: Import your DAGs in a test file and assert certain properties (e.g., number of tasks).
CI/CD: Airflow structures can be integrated into a CI/CD pipeline, checking DAG syntax automatically.

Production-Ready Deployment#

Docker and Docker Compose#

An easy path to production-like environments is Docker. By containerizing Airflow components (webserver, scheduler, workers), you ensure parity across environments. A simple docker-compose.yaml might look like:

1
version: '3'
2
services:
3
  postgres:
4
    image: postgres:13
5
    environment:
6
      POSTGRES_USER: airflow
7
      POSTGRES_PASSWORD: airflow
8
      POSTGRES_DB: airflow
9
  redis:
10
    image: redis:latest
11
  airflow-webserver:
12
    image: apache/airflow:2.3.0
13
    depends_on:
14
      - postgres
15
      - redis
16
    environment:
17
      AIRFLOW__CORE__EXECUTOR: CeleryExecutor
18
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
19
      AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/0
20
      AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
21
    volumes:
22
      - ./dags:/opt/airflow/dags
23
    ports:
24
      - "8080:8080"
25
    command: ["webserver"]
26
  airflow-scheduler:
27
    image: apache/airflow:2.3.0
28
    depends_on:
29
      - airflow-webserver
30
    environment:
31
      AIRFLOW__CORE__EXECUTOR: CeleryExecutor
32
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
33
      AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/0
34
      AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
35
    volumes:
36
      - ./dags:/opt/airflow/dags
37
    command: ["scheduler"]
38
  airflow-worker:
39
    image: apache/airflow:2.3.0
40
    depends_on:
41
      - airflow-webserver
42
    environment:
43
      AIRFLOW__CORE__EXECUTOR: CeleryExecutor
44
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
45
      AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/0
46
      AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
47
    volumes:
48
      - ./dags:/opt/airflow/dags
49
    command: ["celery", "worker"]

Run docker-compose up and you have a multi-container Airflow environment, complete with a Postgres database and Redis broker.

Airflow in the Cloud#

If your team already uses a cloud provider, you can leverage:

Apache Airflow on AWS Managed Workflows for Apache Airflow (MWAA).
Google Cloud Composer.
Astronomer (a managed Airflow platform).

Managed solutions simplify cluster scaling, upgrades, and resource provisioning, letting you focus on DAG creation instead of infrastructure management.

Conclusion#

Airflow provides a flexible, maintainable, and scalable way to orchestrate ETL pipelines. By defining DAGs in Python, you gain the power to build complex data workflows driven by scheduling, sensors, and branching logic. You can easily integrate external systems via operators and sensors, or stretch its functionality with custom operators and plugins.

As you progress:

Start small with local, simple DAGs.
Explore advanced scheduling, branching, and templating.
Scale out using CeleryExecutor or KubernetesExecutor.
Adopt best practices to avoid common pitfalls like sensor overload or extremely large DAGs.
Consider Docker or managed cloud solutions to reduce operational overhead.

With a bit of experimentation and a vision for your data pipelines, Airflow can become the backbone of your organization’s automated ETL, ensuring data reliability, consistency, and traceability.