ETL/ELT & Data Pipeline

Q: ETL or ELT — which is better?

ETL transforms data before storage — suitable for regulated environments with strict rules on stored data. ELT stores raw data and transforms it in the target system (Snowflake, BigQuery, Databricks). We usually choose ELT for better flexibility and auditability, but we decide based on context.

Q: How many pipelines can you manage?

Dozens to hundreds of DAGs in a single Airflow instance. For large organizations, we deploy Airflow on Kubernetes with worker autoscaling. The key is proper dependency design and parallelization.

Q: What happens when a pipeline fails?

Automatic retry with exponential backoff, dead letter queue for failed records, immediate alert to Slack/Teams/PagerDuty. Backfill capability for reprocessing historical data. Most transient errors resolve automatically.

Q: How do you handle schema changes in sources?

Schema evolution detection at pipeline input. Breaking changes trigger an alert and stop the pipeline (better no data than bad data). Non-breaking changes (new column) propagate automatically.

A pipeline that fails silently is worse than no pipeline at all.

We build data pipelines with a production-first approach — monitoring, alerting, retry logic, data quality checks. Airflow, dbt, Spark — we choose based on your requirements, not the hype.

I want reliable pipelines Back to Data Platform

99.9%

Pipeline SLA

4-6 weeks

MVP pipeline

<15 min

Recovery time

Near real-time

Data freshness

Why you need robust data pipelines¶

Data pipelines are the backbone of every data platform. Without reliable data movement and transformation, you have no reporting, no analytics, no AI. Yet most companies treat pipelines as an afterthought — cron scripts, manual exports, unmonitored jobs.

Consequences of poor pipelines¶

Silent failures: Pipeline crashes at night, morning dashboard shows yesterday’s data, nobody knows why
Inconsistent data: Different pipelines transform the same data differently, numbers don’t match
No repeatability: Manual scripts that only one person understands — the one who left last month
Zero visibility: You don’t know what’s running, how long, if it finished, if data passed quality checks

Our approach to data pipelines¶

Apache Airflow — orchestration¶

Airflow is the de facto standard for data workflow orchestration. DAG-based approach enables defining dependencies between tasks, retry logic, SLA monitoring, and parallel processing.

What we implement: - DAG design with clear dependencies and isolated tasks - Taskflow API for cleaner Python code - Custom operators for specific sources (SAP, Salesforce, internal APIs) - Connection management with Vault/Secret Manager - SLA alerting — pipeline must complete within defined time - Backfill — reprocessing historical data without duplicates

dbt — transformations as code¶

dbt (data build tool) brings software engineering best practices to data transformations. SQL-first approach, Git versioning, automated testing, documentation, and lineage.

Why dbt: - Versioning: Every transformation in Git, code review before deployment - Testing: Schema tests (unique, not_null, accepted_values), custom data tests - Documentation: Auto-generated documentation with lineage graph - Incremental processing: Processing only new/changed data, not entire tables - Jinja templating: DRY principles in SQL, parameterized models

Streaming pipelines¶

For use cases requiring low latency, we build streaming pipelines:

Kafka + Kafka Streams: Simple transformations, filtering, real-time enrichment
Apache Flink: Complex stream processing with windowing, complex event processing
Spark Structured Streaming: Micro-batch processing for near-real-time scenarios

Error handling and resilience¶

Every pipeline has a defined failure strategy:

Retry with exponential backoff — transient errors (network timeout, rate limit) resolve automatically
Dead letter queue — records that cannot be processed go to DLQ for manual review
Circuit breaker — if source system repeatedly fails, pipeline temporarily stops calling it
Idempotence — repeated pipeline execution doesn’t create duplicates
Checkpointing — on failure, pipeline continues from last successful point

Monitoring and observability¶

A pipeline without monitoring is a ticking time bomb. We implement:

Execution metrics: Runtime, processed record count, error rate
Data quality metrics: Completeness, freshness, volume anomalies
Alerting: PagerDuty/Slack/Teams notifications on failure or SLA breach
Grafana dashboards: Overview of all pipelines in one place
Data lineage: Where data came from, how it was transformed, where it goes

Typical implementations¶

Batch pipeline for reporting¶

ERP + CRM + e-shop → Airflow orchestration → dbt transformations in Snowflake → Power BI dashboards. Daily/hourly refresh. Implementation: 4-6 weeks.

CDC pipeline for near-real-time¶

Debezium CDC from PostgreSQL → Kafka → stream processing → Delta Lake. Latency under 1 minute. Implementation: 3-5 weeks.

Data lake ingestion¶

Hundreds of sources (APIs, files, databases) → Airbyte/Fivetran → S3/ADLS Bronze layer → dbt Silver/Gold. Scalable to tens of TB daily. Implementation: 6-10 weeks.

Časté otázky

ETL transforms data before storage — suitable for regulated environments with strict rules on stored data. ELT stores raw data and transforms it in the target system (Snowflake, BigQuery, Databricks). We usually choose ELT for better flexibility and auditability, but we decide based on context.

Dozens to hundreds of DAGs in a single Airflow instance. For large organizations, we deploy Airflow on Kubernetes with worker autoscaling. The key is proper dependency design and parallelization.

Automatic retry with exponential backoff, dead letter queue for failed records, immediate alert to Slack/Teams/PagerDuty. Backfill capability for reprocessing historical data. Most transient errors resolve automatically.

Schema evolution detection at pipeline input. Breaking changes trigger an alert and stop the pipeline (better no data than bad data). Non-breaking changes (new column) propagate automatically.

Souvisí s

Data Platform & Integration {'cs': 'ETL/ELT, data lakehouse, real-time pipelines.', 'en': 'ETL/ELT, data lakehouse, real-time pipelines.'}

Cloud & Platform Engineering {'cs': 'Kubernetes, IaC, CI/CD a provoz v cloudu.', 'en': 'Kubernetes, IaC, CI/CD and cloud operations.'}

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku