Modern Data Engineering Stack — From ETL to Real-Time Analytics

If I had a crown for every ETL pipeline built on stored procedures in Oracle… The modern data engineering stack has changed dramatically in recent years. dbt, Airflow, Snowflake — tools that are transforming how we work with data. Here’s our stack and why we chose it.

The Old World: ETL in Oracle¶

Typical Czech enterprise: Oracle DB, stored procedures for transformations, scheduling via Oracle Scheduler or cron. Does it work? Yes. Is it maintainable? Barely. Version control? None. Testing? Non-existent. Documentation? Comments in PL/SQL (maybe).

ELT Instead of ETL¶

Paradigm shift: instead of transforming before loading (ETL), load raw data and transform in the target system (ELT). Why? Because modern warehouses (Snowflake, BigQuery) have enough compute for transformations. And raw data in a landing zone gives you the ability to reprocess when logic changes.

Our Stack¶

Ingestion: Airbyte for standard sources (databases, APIs, SaaS), custom Python scripts for specific sources. Data landing zone in Azure Blob Storage.

Warehouse: Snowflake. Yes, it’s expensive. But: separation of storage and compute, auto-scaling, time travel, zero-copy cloning for dev/test environments. For enterprise with many data consumers, it’s a game changer.

Transformation: dbt (data build tool). SQL transformations versioned in Git, data testing (not null, unique, referential integrity), documentation generated from code. For SQL-native analysts, it’s a revolution.

Orchestration: Apache Airflow. DAGs in Python, rich operator ecosystem, UI for monitoring. On Kubernetes (KubernetesExecutor) for scaling.

BI: Metabase for self-service analytics (simple, free tier), Looker for power users. Metabase on top of Snowflake works surprisingly well for 90% of analytical queries.

dbt — Why It’s a Game Changer¶

dbt turns SQL transformations into a software engineering project: versions in Git, pull requests, code review, CI/CD, tests, documentation. The analyst writes SELECT, dbt handles materialization (view, table, incremental).

-- models/marts/finance/monthly_revenue.sql
SELECT
    date_trunc('month', order_date) AS month,
    SUM(amount) AS revenue,
    COUNT(DISTINCT customer_id) AS customers
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
GROUP BY 1

Real-Time: Kafka¶

For batch, Airflow + dbt is enough. For real-time events (transactions, IoT data, clickstream), Apache Kafka. Kafka → Snowflake (Snowpipe) for near-real-time analytics. Kafka → custom consumer for real-time alerting.

Data Governance¶

GDPR in the Czech Republic is not optional. Data catalog (DataHub), column-level masking in Snowflake, audit logging of who queried what and when. For regulated industries (banking), it’s a prerequisite for operations.

Data as a Product¶

The modern data stack isn’t just about tools — it’s about the approach. Data as a product, version-controlled transformations, automated testing. If you’re still writing stored procedures without version control and tests, it’s time for an upgrade.

data engineeringetlairflowdbt

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.