Apache Airflow is the most widespread data pipeline orchestrator. Define workflows as Python code, schedule execution and monitor progress.
What is Apache Airflow¶
Airflow defines workflows as DAG (Directed Acyclic Graph) — a graph of tasks with operators and dependencies.
Core Concepts¶
- DAG — workflow as Python code
- Operator — individual task (Bash, Python, SQL)
- Scheduler — scheduler based on cron expressions
- Executor — Local, Celery, Kubernetes
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
with DAG(
dag_id='daily_sales',
schedule_interval='0 6 * * *',
start_date=datetime(2026, 1, 1),
catchup=False,
) as dag:
extract = PythonOperator(task_id='extract', python_callable=extract_fn)
transform = PythonOperator(task_id='transform', python_callable=transform_fn)
load = PythonOperator(task_id='load', python_callable=load_fn)
extract >> transform >> load
TaskFlow API (Airflow 2.x)¶
from airflow.decorators import dag, task
@dag(schedule_interval='@daily', start_date=datetime(2026, 1, 1))
def sales_pipeline():
@task()
def extract(): return fetch_data()
@task()
def transform(data): return clean(data)
@task()
def load(data): save(data)
load(transform(extract()))
sales_pipeline()
Best Practices¶
- Idempotence — safe repeated execution
- Atomicity — task succeeds or fails completely
- XCom only for metadata — not for large datasets
Summary¶
Airflow is the standard for orchestration. TaskFlow API simplifies code, the key is idempotence and proper credential management.
apache airfloworchestrationdagpipeline