_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Data Warehouse & Lakehouse

One place for all data. One source of truth.

We design and implement data warehouses and lakehouse architectures that consolidate data from dozens of sources into one reliable repository for reporting, analytics, and AI.

<5s P95
Query latency
6-10 weeks
MVP implementation
PB scale
Scalability
30-60%
Cost optimization

Why centralize data into warehouse/lakehouse

A typical enterprise has data scattered across dozens of systems — ERP, CRM, e-commerce, HR system, Excel files, Google Sheets, third-party APIs. Each system has its own format, definitions, and history. The result:

  • Management can’t get answers to simple questions — “What was the revenue this month?” requires 3 days of analyst work
  • Numbers don’t match — sales reports differently than finance, nobody knows what’s true
  • Historical data is missing — source systems delete or overwrite, no audit trail
  • AI/ML has no data — models need consolidated, clean data in one place

Warehouse vs. Lakehouse vs. Lake

Data Warehouse (Snowflake, BigQuery, Redshift)

For whom: Companies with structured data, primary need is BI and reporting.

  • Schema defined upfront (schema-on-write)
  • Optimized for SQL queries and aggregations
  • ACID transactions, time travel, zero-copy cloning
  • Managed service — no infrastructure to maintain
  • Highest performance for analytical queries

Data Lakehouse (Databricks, Delta Lake, Apache Iceberg)

For whom: Companies with mix of structured and unstructured data, ML/AI workloads.

  • Open formats (Delta, Iceberg, Hudi) — no vendor lock-in
  • Schema-on-read and schema-on-write
  • Unified processing — SQL, Python, Spark, ML in one environment
  • Cost-effective storage (object storage = S3/ADLS)
  • ACID transactions over data lake thanks to Delta/Iceberg

Data Lake (S3/ADLS raw)

For whom: Landing zone for raw data, archival, specific ML pipelines.

  • Cheapest storage
  • No structure — dump anything
  • Without Delta/Iceberg = no ACID, no time travel
  • Typically Bronze layer in Medallion architecture

How we choose technology

We don’t sell one technology. We choose based on your requirements:

Snowflake when: primary use case is BI/reporting, team knows SQL, you need multi-cloud, data sharing between organizations, separation of compute and storage is key.

Databricks when: you need ML/AI workloads alongside analytics, you have large volumes of unstructured data, team knows Python/Spark, you want open-source formats (Delta Lake).

BigQuery when: you’re on Google Cloud, you want serverless (no cluster management), pay-per-query model makes sense for your query patterns, you need GIS/ML integration.

PostgreSQL + dbt when: data volume < 100 GB, budget is limited, team knows PostgreSQL, you don’t need to scale compute independently from storage.

Implementation approach

1. Discovery and data modeling (2-3 weeks)

  • Inventory of sources and data entities
  • Dimensional modeling (Kimball) or Data Vault 2.0
  • Source of truth definition for key entities
  • Naming conventions, data types, standards

2. Infrastructure and ingestion (2-3 weeks)

  • Provisioning warehouse/lakehouse (IaC — Terraform)
  • Ingestion pipeline for key sources
  • Bronze layer — raw data, immutable, partitioned
  • Monitoring and alerting from day one

3. Transformation and business layer (3-4 weeks)

  • dbt project setup with CI/CD
  • Silver layer — cleaning, validation, conforming
  • Gold layer — business-ready views, KPIs, metrics
  • Semantic layer for consistent definitions

4. Optimization and hardening (ongoing)

  • Query performance tuning (clustering, materialized views)
  • Cost optimization (warehouse sizing, auto-suspend, resource monitors)
  • Partitioning and pruning strategies
  • Backup, DR, retention policies

Cost optimization

Cloud warehouse without governance quickly generates unexpected costs. We implement:

  • Resource monitors — automatic shutdown when budget limit is reached
  • Auto-suspend/resume — warehouse doesn’t run when nobody uses it
  • Query profiling — identification of expensive queries, optimization
  • Storage tiering — hot/warm/cold data on different storage levels
  • Reservation vs. on-demand — for predictable workloads reserved capacity saves 30-60%

Časté otázky

Warehouse (Snowflake, BigQuery) is ideal for structured data and BI/reporting. Lakehouse (Databricks, Delta Lake) combines data lake flexibility with warehouse reliability — suitable when you have a mix of structured and unstructured data, or need ML workloads.

Depends on volume and query patterns. Snowflake: from $2-5K/month for smaller companies, $20-100K+ for enterprise. BigQuery: pay-per-query model can be cheaper for sporadic queries. We always design with cost monitoring and optimization from day one.

Yes. We migrate from Oracle, SQL Server, Teradata to cloud solutions. Process: schema mapping, data migration, query translation, parallel run, cutover. Typically 2-4 months depending on complexity.

Depends on requirements: logical separation (row-level security, schemas) for cost efficiency, or physical separation (dedicated warehouse/cluster) for regulated sectors. Usually logical separation with RBAC is sufficient.

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku