ETL/ELT & Data Pipeline
A pipeline that fails silently is worse than no pipeline at all.
We build data pipelines with a production-first approach — monitoring, alerting, retry logic, data quality checks. Airflow, dbt, Spark — we choose based on your requirements, not the hype.
Why you need robust data pipelines¶
Data pipelines are the backbone of every data platform. Without reliable data movement and transformation, you have no reporting, no analytics, no AI. Yet most companies treat pipelines as an afterthought — cron scripts, manual exports, unmonitored jobs.
Consequences of poor pipelines¶
- Silent failures: Pipeline crashes at night, morning dashboard shows yesterday’s data, nobody knows why
- Inconsistent data: Different pipelines transform the same data differently, numbers don’t match
- No repeatability: Manual scripts that only one person understands — the one who left last month
- Zero visibility: You don’t know what’s running, how long, if it finished, if data passed quality checks
Our approach to data pipelines¶
Apache Airflow — orchestration¶
Airflow is the de facto standard for data workflow orchestration. DAG-based approach enables defining dependencies between tasks, retry logic, SLA monitoring, and parallel processing.
What we implement: - DAG design with clear dependencies and isolated tasks - Taskflow API for cleaner Python code - Custom operators for specific sources (SAP, Salesforce, internal APIs) - Connection management with Vault/Secret Manager - SLA alerting — pipeline must complete within defined time - Backfill — reprocessing historical data without duplicates
dbt — transformations as code¶
dbt (data build tool) brings software engineering best practices to data transformations. SQL-first approach, Git versioning, automated testing, documentation, and lineage.
Why dbt: - Versioning: Every transformation in Git, code review before deployment - Testing: Schema tests (unique, not_null, accepted_values), custom data tests - Documentation: Auto-generated documentation with lineage graph - Incremental processing: Processing only new/changed data, not entire tables - Jinja templating: DRY principles in SQL, parameterized models
Streaming pipelines¶
For use cases requiring low latency, we build streaming pipelines:
- Kafka + Kafka Streams: Simple transformations, filtering, real-time enrichment
- Apache Flink: Complex stream processing with windowing, complex event processing
- Spark Structured Streaming: Micro-batch processing for near-real-time scenarios
Error handling and resilience¶
Every pipeline has a defined failure strategy:
- Retry with exponential backoff — transient errors (network timeout, rate limit) resolve automatically
- Dead letter queue — records that cannot be processed go to DLQ for manual review
- Circuit breaker — if source system repeatedly fails, pipeline temporarily stops calling it
- Idempotence — repeated pipeline execution doesn’t create duplicates
- Checkpointing — on failure, pipeline continues from last successful point
Monitoring and observability¶
A pipeline without monitoring is a ticking time bomb. We implement:
- Execution metrics: Runtime, processed record count, error rate
- Data quality metrics: Completeness, freshness, volume anomalies
- Alerting: PagerDuty/Slack/Teams notifications on failure or SLA breach
- Grafana dashboards: Overview of all pipelines in one place
- Data lineage: Where data came from, how it was transformed, where it goes
Typical implementations¶
Batch pipeline for reporting¶
ERP + CRM + e-shop → Airflow orchestration → dbt transformations in Snowflake → Power BI dashboards. Daily/hourly refresh. Implementation: 4-6 weeks.
CDC pipeline for near-real-time¶
Debezium CDC from PostgreSQL → Kafka → stream processing → Delta Lake. Latency under 1 minute. Implementation: 3-5 weeks.
Data lake ingestion¶
Hundreds of sources (APIs, files, databases) → Airbyte/Fivetran → S3/ADLS Bronze layer → dbt Silver/Gold. Scalable to tens of TB daily. Implementation: 6-10 weeks.
Časté otázky
ETL transforms data before storage — suitable for regulated environments with strict rules on stored data. ELT stores raw data and transforms it in the target system (Snowflake, BigQuery, Databricks). We usually choose ELT for better flexibility and auditability, but we decide based on context.
Dozens to hundreds of DAGs in a single Airflow instance. For large organizations, we deploy Airflow on Kubernetes with worker autoscaling. The key is proper dependency design and parallelization.
Automatic retry with exponential backoff, dead letter queue for failed records, immediate alert to Slack/Teams/PagerDuty. Backfill capability for reprocessing historical data. Most transient errors resolve automatically.
Schema evolution detection at pipeline input. Breaking changes trigger an alert and stop the pipeline (better no data than bad data). Non-breaking changes (new column) propagate automatically.