_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Data Pipelines with Apache Airflow — Orchestrating Data Flows

18. 01. 2021 1 min read CORE SYSTEMSai
Data Pipelines with Apache Airflow — Orchestrating Data Flows

As the number of ML projects grew, we hit a problem: how to reliably orchestrate data flows? Cron jobs were no longer enough. Apache Airflow became our solution.

Why Not Cron?

Cron has no dependency management, retry logic, or monitoring. Airflow has all of that — DAGs (workflows as Python code), operators, a scheduler, and a web UI for monitoring and manual triggers.

Our Kubernetes Setup

Airflow runs on AKS with KubernetesExecutor — each task as a separate pod. Metadata in Azure PostgreSQL, logs in Blob Storage. DAGs are versioned in Git, synchronized via a git-sync sidecar.

Practical Lessons

  • Idempotency — UPSERT instead of INSERT, partitioning by execution date
  • Testing DAGs — unit tests for structure validation, integration tests with mock data
  • Alerting — Slack + PagerDuty for critical pipelines

Airflow = The Backbone of Data Engineering

Flexible, extensible, strong community. Requires an upfront investment in setup, but for serious data engineering it’s indispensable.

airflowetldata pipelinepythonorchestration
Share:

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Contact us