From Raw to Refined: Building ETL Pipelines with Airflow#

Building resilient ETL (Extract, Transform, Load) pipelines is the cornerstone of modern data engineering. As data volumes grow and use cases become more sophisticated, orchestrating daily, hourly, or even real-time data processes can get complex. Apache Airflow, a platform created by the community to programmatically author, schedule, and monitor workflows, has quickly become a go-to solution for managing complex data pipelines. In this blog post, we’ll walk you through the fundamentals of ETL pipelines, introduce Airflow’s features, and dive into a step-by-step approach to building everything from basic to advanced pipelines. Whether you are just starting out or looking to expand your professional-level expertise, this guide will help you build robust ETL workflows using Airflow.

Table of Contents#

What is ETL, and Why Does It Matter?
Introduction to Airflow
Setting Up Airflow Locally
Core Airflow Concepts: DAGs, Tasks, and Operators
Your First Airflow DAG
Data Extraction in Airflow
Data Transformation in Airflow
Loading Data into Target Systems
Advanced Airflow Features
Monitoring, Logging, and Alerting
Performance and Scalability Best Practices
Real-World Use Cases and Project Ideas
Conclusion

What is ETL, and Why Does It Matter?#

Before diving into Airflow, let’s clarify the concept of ETL:

Extract (E): This step involves fetching data from one or multiple sources (databases, APIs, CSV files, etc.).
Transform (T): In this phase, we convert, clean, and shape the data. This might include operations such as deduplication, data standardization, validations, or enrichment from additional sources.
Load (L): Finally, we move the transformed data to a destination (data warehouse, analytics system, or database) so that it can be used for reporting or analysis.

ETL is essential because it guarantees that raw data is standardized, clean, and ready for analytical or operational use. Without a solid ETL strategy, organizations quickly end up with siloed, messy data that hinders decision-making.

Introduction to Airflow#

Apache Airflow started as a solution within Airbnb to manage complex workflows. It is now an open-source project governed by the Apache Software Foundation. Airflow enables you to:

Programmatically author workflows: You write Python code to define your data pipelines.
Schedule workflows: You can schedule tasks to run periodically or on specific time intervals.
Monitor workflows: Airflow provides a handy web-based interface to monitor workflow states, logs, and manage tasks’ execution.

Its core principles include:

DAG-based scheduling: Airflow organizes tasks into Directed Acyclic Graphs (DAGs). Each DAG represents a complete workflow and defines the execution order of tasks.
Dependency management: You can specify the exact order in which tasks should be executed and set dependencies for more complex scenarios.
Scalability and extensibility: Airflow supports distributed execution via Celery or Kubernetes, and you can write custom operators and hooks to interface with nearly any system.

Setting Up Airflow Locally#

Prerequisites#

Python 3.7 or higher.
A virtual environment manager like venv or conda (recommended).
pip (to install Airflow and its dependencies).

Though Airflow can be deployed in various ways (locally, Docker, Kubernetes), starting locally is often the simplest for beginners.

Installation Steps#

Create and activate a virtual environment.
Example with venv:
Terminal window
```
1
python -m venv airflow-env
2
source airflow-env/bin/activate
```
Install Airflow via pip.
Terminal window
```
1
pip install "apache-airflow[postgres,google]>=2.5,<3"
```
The [postgres,google] part indicates additional dependencies for using Postgres and Google integrations, but you can tailor it to your environment.
Initialize Airflow’s metadata database.
Terminal window
```
1
airflow db init
```

Create an admin user (for the web interface).

1
airflow users create \
2
--username admin \
3
--firstname Airflow \
4
--lastname Admin \
5
--role Admin \
6
--email admin@example.com

Start the Airflow scheduler and webserver.
Terminal window
```
1
airflow scheduler &
2
airflow webserver
```
By default, the Airflow UI will be accessible at http://localhost:8080.

Post-Installation Configuration#

airflow.cfg file: You can change default settings like executor, sql_alchemy_conn, or web_server_port.
Environment variables: Airflow can read environment variables for settings, which helps when deploying to different environments (e.g., dev, staging, prod).

Core Airflow Concepts: DAGs, Tasks, and Operators#

Directed Acyclic Graph (DAG)#

A DAG is a collection of tasks with dependencies that define how they’re organized and in what order they should run. Each node (task) in the DAG represents work to be done, and edges represent the required sequence.

Tasks#

Within a DAG, a task is a parameterized instance of one of Airflow’s classes that tells Airflow what to do (run a Python function, execute a Bash command, move files, etc.).

Operators#

Operators are templates for tasks. Airflow comes with many operators out of the box:

BashOperator: Executes a bash command.
PythonOperator: Runs a Python function.
PostgresOperator: Executes SQL statements in a Postgres database.
S3ToRedshiftOperator: Moves data from S3 to Redshift.
And many others for AWS, GCP, Spark, etc.

In addition, there are advanced concepts like Hooks (wrappers around external APIs or databases) and Sensors (tasks that wait for a certain condition to be met).

Your First Airflow DAG#

Below is a simple DAG that has two tasks: one prints a welcome message, and the other prints the current date. It illustrates basic concepts like scheduling and task dependencies.

1
import datetime
2
from airflow import DAG
3
from airflow.operators.bash import BashOperator
4

5
default_args = {
6
    'owner': 'data_engineer',
7
    'start_date': datetime.datetime(2023, 9, 1),
8
    'retries': 1,
9
    'retry_delay': datetime.timedelta(minutes=2)
10
}
11

12
with DAG(
13
    dag_id='simple_dag',
14
    default_args=default_args,
15
    schedule_interval='@daily',
16
    catchup=False
17
) as dag:
18

19
    task_hello = BashOperator(
20
        task_id='hello',
21
        bash_command='echo "Hello, Airflow!"'
22
    )
23

24
    task_date = BashOperator(
25
        task_id='show_date',
26
        bash_command='date'
27
    )
28

29
    task_hello >> task_date

Explanation#

default_args: Arguments passed to each task (owner, start date, retries, etc.).
dag_id: Unique identifier for the DAG.
schedule_interval: Defines how often the DAG runs (@daily, @hourly, a cron expression, etc.).
task_hello >> task_date: The “>>” sets the execution order, ensuring task_hello completes before task_date.

Place this Python file in the dags folder (usually located in ~/airflow/dags by default), and Airflow will automatically detect it. You’ll see it in the Airflow web UI and can trigger or schedule it as needed.

Data Extraction in Airflow#

Common Data Sources#

Relational Databases (Postgres, MySQL, Oracle)
Cloud Storage (AWS S3, GCS)
SaaS APIs (Salesforce, Stripe)
Flat Files (CSV, Parquet)

Airflow offers a variety of operators and hooks to connect to these services. Below is a short table that summarizes Airflow operators/hook pairs for some typical sources:

Data Source	Hook/Operator	Description
Postgres	PostgresHook, PostgresOperator	Execute SQL statements, fetch data from tables
MySQL	MySqlHook, MySqlOperator	Similar usage to Postgres but for MySQL
S3	S3Hook	List, get, put files in AWS S3 buckets
GCS	GCSHook	Convenient methods for interacting with GCS
APIs	HttpHook	Make custom REST calls

Example: Extract Data from a Database#

Say you have a Postgres database storing user data. You want to export a table to a CSV file in an S3 bucket daily. Below is a simplified outline of how you’d do it with Airflow:

1
import datetime
2
from airflow import DAG
3
from airflow.operators.python_operator import PythonOperator
4
from airflow.providers.postgres.hooks.postgres import PostgresHook
5
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
6

7
def extract_data_from_postgres():
8
    pg_hook = PostgresHook(postgres_conn_id='my_postgres')
9
    records = pg_hook.get_pandas_df("SELECT * FROM users")
10
    records.to_csv('/tmp/users_data.csv', index=False)
11

12
def upload_to_s3():
13
    s3_hook = S3Hook(aws_conn_id='my_s3')
14
    s3_hook.load_file('/tmp/users_data.csv', key='users_data.csv', bucket_name='my-bucket', replace=True)
15

16
default_args = {
17
    'owner': 'data_engineer',
18
    'start_date': datetime.datetime(2023, 9, 1),
19
}
20

21
with DAG(
22
    dag_id='extract_postgres_to_s3',
23
    default_args=default_args,
24
    schedule_interval='@daily',
25
    catchup=False,
26
) as dag:
27

28
    extract_task = PythonOperator(
29
        task_id='extract_postgres',
30
        python_callable=extract_data_from_postgres
31
    )
32

33
    upload_task = PythonOperator(
34
        task_id='upload_to_s3',
35
        python_callable=upload_to_s3
36
    )
37

38
    extract_task >> upload_task

Here, we have two Python functions: one to fetch data into a CSV, and another to upload that CSV to S3. Airflow’s PostgresHook sums up hours of boilerplate you’d otherwise need for connecting to databases.

Data Transformation in Airflow#

Depending on your use cases, transformations can happen in multiple ways:

Operator-based transformations: Using built-in or custom Python logic in a PythonOperator.
External engines: Offloading to Spark, Hadoop, or a cloud service (like AWS EMR or GCP Dataproc).
SQL-based transformations: Running SQL on a data warehouse (e.g., BigQuery, Snowflake) via an operator.

Example Transformation#

Let’s illustrate a simple Python transformation for normalizing some data columns:

1
import datetime
2
from airflow import DAG
3
from airflow.operators.python_operator import PythonOperator
4

5
def transform_data():
6
    import pandas as pd
7

8
    df = pd.read_csv('/tmp/users_data.csv')
9
    # Normalize columns
10
    df['email'] = df['email'].str.lower()
11
    # Convert sign-up date from string to datetime
12
    df['sign_up_date'] = pd.to_datetime(df['sign_up_date'], format='%Y-%m-%d')
13
    # Save transformed data
14
    df.to_csv('/tmp/users_data_transformed.csv', index=False)
15

16
default_args = {
17
    'owner': 'data_engineer',
18
    'start_date': datetime.datetime(2023, 9, 1),
19
}
20

21
with DAG(
22
    dag_id='transform_data_dag',
23
    default_args=default_args,
24
    schedule_interval=None,
25
    catchup=False,
26
) as dag:
27

28
    transform_task = PythonOperator(
29
        task_id='transform',
30
        python_callable=transform_data
31
    )

This DAG has a single task that reads a CSV, applies transformations, and writes the result to a new CSV.
In more advanced contexts, you could read from a database or an object store, process large batches with a distributed framework, or chain multiple transformations together in a single DAG.

Loading Data into Target Systems#

Common Targets for Load#

Data Warehouse: Redshift, BigQuery, Snowflake.
Analytical DB: Postgres, MySQL.
NoSQL Stores: MongoDB, Cassandra.
Data Lakes: S3 (Parquet or ORC files), HDFS.

Loading data often entails using specialized operators or writing custom scripts to push data to the destination. Below is an example using the BigQuery operator to load data from GCS:

1
import datetime
2
from airflow import DAG
3
from airflow.providers.google.cloud.operators.bigquery import BigQueryCreateExternalTableOperator
4

5
default_args = {
6
    'owner': 'data_engineer',
7
    'start_date': datetime.datetime(2023, 9, 1),
8
}
9

10
with DAG(
11
    dag_id='load_gcs_to_bigquery',
12
    default_args=default_args,
13
    schedule_interval='@daily',
14
    catchup=False,
15
) as dag:
16

17
    create_external_table = BigQueryCreateExternalTableOperator(
18
        task_id='create_external_table',
19
        bucket='my-data',
20
        source_objects=['users_data_transformed.csv'],
21
        destination_project_dataset_table='my_project.my_dataset.users_data',
22
        source_format='CSV',
23
        skip_leading_rows=1,
24
        field_delimiter=',',
25
        schema_fields=[
26
            {'name': 'id', 'type': 'INTEGER', 'mode': 'REQUIRED'},
27
            {'name': 'name', 'type': 'STRING', 'mode': 'NULLABLE'},
28
            {'name': 'email', 'type': 'STRING', 'mode': 'NULLABLE'},
29
            {'name': 'sign_up_date', 'type': 'TIMESTAMP', 'mode': 'NULLABLE'}
30
        ],
31
        google_cloud_storage_conn_id='my_gcp_conn',
32
        bigquery_conn_id='my_bigquery_conn'
33
    )

This operator creates an external table in BigQuery reading from your file in GCS. The same concept applies to other warehouses like Snowflake or Redshift, where corresponding operators exist (e.g., SnowflakeOperator, RedshiftSQLOperator).

Advanced Airflow Features#

XComs (Cross-Communication)#

XCom is Airflow’s built-in mechanism for sharing data between tasks. Rather than persisting data in files, you can push and pull small amounts of data (like IDs, metadata, or simple objects) across tasks.

1
def push_data(**context):
2
    context['ti'].xcom_push(key='my_data', value='hello_world')
3

4
def pull_data(**context):
5
    result = context['ti'].xcom_pull(key='my_data', task_ids='push_task')
6
    print("Received data:", result)

Branching#

Branching allows you to conditionally choose a path in your DAG. For instance, you can decide which tasks to run based on whether your data meets some criteria.

1
from airflow.operators.python import BranchPythonOperator
2

3
def choose_branch(**context):
4
    # Pseudo condition
5
    if context['execution_date'].weekday() < 5:
6
        return 'week_day_task'
7
    else:
8
        return 'weekend_task'
9

10
branch_op = BranchPythonOperator(
11
    task_id='branch_op',
12
    python_callable=choose_branch
13
)

SubDAGs#

SubDAGs are a way to nest a DAG inside another DAG for modularity. However, SubDAGs are largely replaced by Task Groups in modern Airflow versions. Task Groups provide a simpler mechanism for grouping tasks visually and logically without introducing a brand new DAG.

Task Groups#

1
from airflow.utils.task_group import TaskGroup
2
with DAG(...) as dag:
3
    with TaskGroup("extraction_group") as extraction_group:
4
        # define tasks for extraction
5
        ...
6
    with TaskGroup("transformation_group") as transformation_group:
7
        # define tasks for transformation
8
        ...

Sensors#

Sensors are simply specialized operators that wait for a certain event. For example, an S3KeySensor waits for a file to land in an S3 bucket before allowing downstream tasks to proceed.

Monitoring, Logging, and Alerting#

Airflow UI#

DAGs View: A list of all the DAGs in your environment.
Graph View: A graphical representation of tasks and their dependencies.
Tree View: Historical runs of your DAG over time.
Task Instance Logs: Logs from each task run.

Email and Slack Notifications#

You can configure email or Slack alerts for task failures or retries. Update your airflow.cfg or use environment variables to set email or Slack hooks. For Slack, you’d typically use the SlackWebhookOperator or custom on-failure callbacks:

1
from airflow.providers.slack.operators.slack_webhook import SlackWebhookOperator
2

3
def slack_fail_alert(context):
4
    slack_msg = f"DAG {context['dag'].dag_id} failed on task {context['task_instance'].task_id}"
5
    alert = SlackWebhookOperator(
6
        task_id='slack_fail',
7
        http_conn_id='slack_conn',
8
        webhook_token='YOUR_WEBHOOK_TOKEN',
9
        message=slack_msg,
10
        username='airflow'
11
    )
12
    return alert.execute(context=context)

And then in your task definition:

1
PythonOperator(
2
    task_id='some_task',
3
    python_callable=my_callable,
4
    on_failure_callback=slack_fail_alert
5
)

Performance and Scalability Best Practices#

Executors#

Airflow comes with different executors to handle how tasks are run:

SequentialExecutor: Runs one task at a time (default for local testing).
LocalExecutor: Runs parallel tasks on a single machine.
CeleryExecutor: Distributes tasks across multiple worker nodes.
KubernetesExecutor: Schedules each task in a separate Kubernetes pod.

Database Optimizations#

Ensure you use a production-grade database for the Airflow metadata (e.g., Postgres or MySQL).
Avoid SQLite in production scenarios as it can cause concurrency issues.
Clean up old logs and XCom entries regularly to prevent the metadata DB from bloating.

Scaling with Celery or Kubernetes#

CeleryExecutor: Requires a message broker like RabbitMQ or Redis. You set up multiple worker nodes to process tasks in parallel.
KubernetesExecutor: Each Airflow task runs in its own Kubernetes pod. This is highly dynamic and can scale well if you already have a Kubernetes cluster.

Parallelism and Concurrency#

parallelism: The max number of task instances that can run across all DAGs.
dag_concurrency: Max number of task instances for a single DAG at once.
max_active_runs_per_dag: Limits how many DAG runs can be active simultaneously.

Tuning these parameters in airflow.cfg helps manage resource usage.

Real-World Use Cases and Project Ideas#

Daily Data Warehouse Load: Pull data from an operational Postgres DB, transform it in Python, load it into Redshift or BigQuery.
Event-Driven Pipelines: Trigger Airflow workflows when a file arrives in S3 or GCS.
Machine Learning Workflows: Chain data extraction, feature engineering, model training, and model deployment steps.
Pipeline for IoT Data: Ingest sensor data from MQTT or Kafka, perform real-time transformations, and push to a data lake or time-series DB.
Reporting and Dashboard Refresh: Automate the generation of summary tables and refresh BI dashboards in Looker, Tableau, or Power BI.

Conclusion#

Apache Airflow is a powerful orchestrator that bridges the gap between raw data and refined insights. By defining your ETL (or ELT) workflows as code, you gain clarity, version control, and the ability to scale easily over time. You’ve seen how to set up Airflow locally, build basic DAGs, connect to various data sources, implement transformations, and load the results into different targets. We also covered advanced features, best practices for performance, and real-world scenarios.

As you progress:

Explore more operators and hooks to reduce custom boilerplate.
Integrate Airflow with container orchestration platforms like Kubernetes for scalable workloads.
Investigate advanced concepts like dynamic DAG generation or distributed runtime environments.

Modern data engineering demands reliable pipelines. With Airflow at the heart of your stack, you have a battle-tested, production-grade tool to unify complex processes. As you continue building and refining your ETL workflows, Airflow will be a central pillar that helps turn raw data into insightful, actionable information. Happy orchestrating!