Data Warehouse & Lakehouse

Q: Data warehouse or data lakehouse?

Warehouse (Snowflake, BigQuery) is ideal for structured data and BI/reporting. Lakehouse (Databricks, Delta Lake) combines data lake flexibility with warehouse reliability — suitable when you have a mix of structured and unstructured data, or need ML workloads.

Q: How much does data warehouse operation cost?

Depends on volume and query patterns. Snowflake: from $2-5K/month for smaller companies, $20-100K+ for enterprise. BigQuery: pay-per-query model can be cheaper for sporadic queries. We always design with cost monitoring and optimization from day one.

Q: Can we migrate from on-premise warehouse?

Yes. We migrate from Oracle, SQL Server, Teradata to cloud solutions. Process: schema mapping, data migration, query translation, parallel run, cutover. Typically 2-4 months depending on complexity.

Q: How do you handle multitenancy?

Depends on requirements: logical separation (row-level security, schemas) for cost efficiency, or physical separation (dedicated warehouse/cluster) for regulated sectors. Usually logical separation with RBAC is sufficient.

One place for all data. One source of truth.

We design and implement data warehouses and lakehouse architectures that consolidate data from dozens of sources into one reliable repository for reporting, analytics, and AI.

I want consolidated data Back to Data Platform

<5s P95

Query latency

6-10 weeks

MVP implementation

PB scale

Scalability

30-60%

Cost optimization

Why centralize data into warehouse/lakehouse¶

A typical enterprise has data scattered across dozens of systems — ERP, CRM, e-commerce, HR system, Excel files, Google Sheets, third-party APIs. Each system has its own format, definitions, and history. The result:

Management can’t get answers to simple questions — “What was the revenue this month?” requires 3 days of analyst work
Numbers don’t match — sales reports differently than finance, nobody knows what’s true
Historical data is missing — source systems delete or overwrite, no audit trail
AI/ML has no data — models need consolidated, clean data in one place

Warehouse vs. Lakehouse vs. Lake¶

Data Warehouse (Snowflake, BigQuery, Redshift)¶

For whom: Companies with structured data, primary need is BI and reporting.

Schema defined upfront (schema-on-write)
Optimized for SQL queries and aggregations
ACID transactions, time travel, zero-copy cloning
Managed service — no infrastructure to maintain
Highest performance for analytical queries

Data Lakehouse (Databricks, Delta Lake, Apache Iceberg)¶

For whom: Companies with mix of structured and unstructured data, ML/AI workloads.

Open formats (Delta, Iceberg, Hudi) — no vendor lock-in
Schema-on-read and schema-on-write
Unified processing — SQL, Python, Spark, ML in one environment
Cost-effective storage (object storage = S3/ADLS)
ACID transactions over data lake thanks to Delta/Iceberg

Data Lake (S3/ADLS raw)¶

For whom: Landing zone for raw data, archival, specific ML pipelines.

Cheapest storage
No structure — dump anything
Without Delta/Iceberg = no ACID, no time travel
Typically Bronze layer in Medallion architecture

How we choose technology¶

We don’t sell one technology. We choose based on your requirements:

Snowflake when: primary use case is BI/reporting, team knows SQL, you need multi-cloud, data sharing between organizations, separation of compute and storage is key.

Databricks when: you need ML/AI workloads alongside analytics, you have large volumes of unstructured data, team knows Python/Spark, you want open-source formats (Delta Lake).

BigQuery when: you’re on Google Cloud, you want serverless (no cluster management), pay-per-query model makes sense for your query patterns, you need GIS/ML integration.

PostgreSQL + dbt when: data volume < 100 GB, budget is limited, team knows PostgreSQL, you don’t need to scale compute independently from storage.

Implementation approach¶

1. Discovery and data modeling (2-3 weeks)¶

Inventory of sources and data entities
Dimensional modeling (Kimball) or Data Vault 2.0
Source of truth definition for key entities
Naming conventions, data types, standards

2. Infrastructure and ingestion (2-3 weeks)¶

Provisioning warehouse/lakehouse (IaC — Terraform)
Ingestion pipeline for key sources
Bronze layer — raw data, immutable, partitioned
Monitoring and alerting from day one

3. Transformation and business layer (3-4 weeks)¶

dbt project setup with CI/CD
Silver layer — cleaning, validation, conforming
Gold layer — business-ready views, KPIs, metrics
Semantic layer for consistent definitions

4. Optimization and hardening (ongoing)¶

Query performance tuning (clustering, materialized views)
Cost optimization (warehouse sizing, auto-suspend, resource monitors)
Partitioning and pruning strategies
Backup, DR, retention policies

Cost optimization¶

Cloud warehouse without governance quickly generates unexpected costs. We implement:

Resource monitors — automatic shutdown when budget limit is reached
Auto-suspend/resume — warehouse doesn’t run when nobody uses it
Query profiling — identification of expensive queries, optimization
Storage tiering — hot/warm/cold data on different storage levels
Reservation vs. on-demand — for predictable workloads reserved capacity saves 30-60%

Časté otázky

Warehouse (Snowflake, BigQuery) is ideal for structured data and BI/reporting. Lakehouse (Databricks, Delta Lake) combines data lake flexibility with warehouse reliability — suitable when you have a mix of structured and unstructured data, or need ML workloads.

Depends on volume and query patterns. Snowflake: from $2-5K/month for smaller companies, $20-100K+ for enterprise. BigQuery: pay-per-query model can be cheaper for sporadic queries. We always design with cost monitoring and optimization from day one.

Yes. We migrate from Oracle, SQL Server, Teradata to cloud solutions. Process: schema mapping, data migration, query translation, parallel run, cutover. Typically 2-4 months depending on complexity.

Depends on requirements: logical separation (row-level security, schemas) for cost efficiency, or physical separation (dedicated warehouse/cluster) for regulated sectors. Usually logical separation with RBAC is sufficient.

Souvisí s

Cloud & Platform Engineering {'cs': 'Kubernetes, IaC, CI/CD a provoz v cloudu.', 'en': 'Kubernetes, IaC, CI/CD and cloud operations.'}

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku