_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

ETL/ELT & Data Pipeline

A pipeline that fails silently is worse than no pipeline at all.

We build data pipelines with a production-first approach — monitoring, alerting, retry logic, data quality checks. Airflow, dbt, Spark — we choose based on your requirements, not the hype.

99.9%
Pipeline SLA
4-6 weeks
MVP pipeline
<15 min
Recovery time
Near real-time
Data freshness

Why you need robust data pipelines

Data pipelines are the backbone of every data platform. Without reliable data movement and transformation, you have no reporting, no analytics, no AI. Yet most companies treat pipelines as an afterthought — cron scripts, manual exports, unmonitored jobs.

Consequences of poor pipelines

  • Silent failures: Pipeline crashes at night, morning dashboard shows yesterday’s data, nobody knows why
  • Inconsistent data: Different pipelines transform the same data differently, numbers don’t match
  • No repeatability: Manual scripts that only one person understands — the one who left last month
  • Zero visibility: You don’t know what’s running, how long, if it finished, if data passed quality checks

Our approach to data pipelines

Apache Airflow — orchestration

Airflow is the de facto standard for data workflow orchestration. DAG-based approach enables defining dependencies between tasks, retry logic, SLA monitoring, and parallel processing.

What we implement: - DAG design with clear dependencies and isolated tasks - Taskflow API for cleaner Python code - Custom operators for specific sources (SAP, Salesforce, internal APIs) - Connection management with Vault/Secret Manager - SLA alerting — pipeline must complete within defined time - Backfill — reprocessing historical data without duplicates

dbt — transformations as code

dbt (data build tool) brings software engineering best practices to data transformations. SQL-first approach, Git versioning, automated testing, documentation, and lineage.

Why dbt: - Versioning: Every transformation in Git, code review before deployment - Testing: Schema tests (unique, not_null, accepted_values), custom data tests - Documentation: Auto-generated documentation with lineage graph - Incremental processing: Processing only new/changed data, not entire tables - Jinja templating: DRY principles in SQL, parameterized models

Streaming pipelines

For use cases requiring low latency, we build streaming pipelines:

  • Kafka + Kafka Streams: Simple transformations, filtering, real-time enrichment
  • Apache Flink: Complex stream processing with windowing, complex event processing
  • Spark Structured Streaming: Micro-batch processing for near-real-time scenarios

Error handling and resilience

Every pipeline has a defined failure strategy:

  1. Retry with exponential backoff — transient errors (network timeout, rate limit) resolve automatically
  2. Dead letter queue — records that cannot be processed go to DLQ for manual review
  3. Circuit breaker — if source system repeatedly fails, pipeline temporarily stops calling it
  4. Idempotence — repeated pipeline execution doesn’t create duplicates
  5. Checkpointing — on failure, pipeline continues from last successful point

Monitoring and observability

A pipeline without monitoring is a ticking time bomb. We implement:

  • Execution metrics: Runtime, processed record count, error rate
  • Data quality metrics: Completeness, freshness, volume anomalies
  • Alerting: PagerDuty/Slack/Teams notifications on failure or SLA breach
  • Grafana dashboards: Overview of all pipelines in one place
  • Data lineage: Where data came from, how it was transformed, where it goes

Typical implementations

Batch pipeline for reporting

ERP + CRM + e-shop → Airflow orchestration → dbt transformations in Snowflake → Power BI dashboards. Daily/hourly refresh. Implementation: 4-6 weeks.

CDC pipeline for near-real-time

Debezium CDC from PostgreSQL → Kafka → stream processing → Delta Lake. Latency under 1 minute. Implementation: 3-5 weeks.

Data lake ingestion

Hundreds of sources (APIs, files, databases) → Airbyte/Fivetran → S3/ADLS Bronze layer → dbt Silver/Gold. Scalable to tens of TB daily. Implementation: 6-10 weeks.

Časté otázky

ETL transforms data before storage — suitable for regulated environments with strict rules on stored data. ELT stores raw data and transforms it in the target system (Snowflake, BigQuery, Databricks). We usually choose ELT for better flexibility and auditability, but we decide based on context.

Dozens to hundreds of DAGs in a single Airflow instance. For large organizations, we deploy Airflow on Kubernetes with worker autoscaling. The key is proper dependency design and parallelization.

Automatic retry with exponential backoff, dead letter queue for failed records, immediate alert to Slack/Teams/PagerDuty. Backfill capability for reprocessing historical data. Most transient errors resolve automatically.

Schema evolution detection at pipeline input. Breaking changes trigger an alert and stop the pipeline (better no data than bad data). Non-breaking changes (new column) propagate automatically.

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku