Data lake is a central repository for raw data in any format. From structured tables to unstructured logs — everything on cheap object storage.
What is Data Lake¶
Stores data in raw form — schema-on-read.
Architecture¶
- Storage — S3, GCS, ADLS
- Formats — Parquet, Avro, JSON
- Catalog — Glue, Hive Metastore
- Compute — Spark, Trino, DuckDB
s3://data-lake/
├── raw/ # Bronze
│ ├── orders/
│ └── events/
├── processed/ # Silver
│ └── orders/
├── curated/ # Gold
│ └── daily_revenue/
└── _metadata/
What to Avoid (Data Swamp)¶
- Missing catalog
- No governance
- Small files — thousands of 1KB files
- Missing lineage
Summary¶
Data lake with Table Formats and governance becomes lakehouse — reliable foundation for analytics.
data lakearchitectureobject storagebig data