As the number of ML projects grew, we hit a problem: how to reliably orchestrate data flows? Cron jobs were no longer enough. Apache Airflow became our solution.
Why Not Cron?¶
Cron has no dependency management, retry logic, or monitoring. Airflow has all of that — DAGs (workflows as Python code), operators, a scheduler, and a web UI for monitoring and manual triggers.
Our Kubernetes Setup¶
Airflow runs on AKS with KubernetesExecutor — each task as a separate pod. Metadata in Azure PostgreSQL, logs in Blob Storage. DAGs are versioned in Git, synchronized via a git-sync sidecar.
Practical Lessons¶
- Idempotency — UPSERT instead of INSERT, partitioning by execution date
- Testing DAGs — unit tests for structure validation, integration tests with mock data
- Alerting — Slack + PagerDuty for critical pipelines
Airflow = The Backbone of Data Engineering¶
Flexible, extensible, strong community. Requires an upfront investment in setup, but for serious data engineering it’s indispensable.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us