_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Data Lake — Architecture for Storing Raw Data

25. 08. 2025 1 min read intermediate

Data lake is a central repository for raw data in any format. From structured tables to unstructured logs — everything on cheap object storage.

What is Data Lake

Stores data in raw form — schema-on-read.

Architecture

  • Storage — S3, GCS, ADLS
  • Formats — Parquet, Avro, JSON
  • Catalog — Glue, Hive Metastore
  • Compute — Spark, Trino, DuckDB
s3://data-lake/
├── raw/           # Bronze
│   ├── orders/
│   └── events/
├── processed/     # Silver
│   └── orders/
├── curated/       # Gold
│   └── daily_revenue/
└── _metadata/

What to Avoid (Data Swamp)

  • Missing catalog
  • No governance
  • Small files — thousands of 1KB files
  • Missing lineage

Summary

Data lake with Table Formats and governance becomes lakehouse — reliable foundation for analytics.

data lakearchitectureobject storagebig data
Share:

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.