Delta Lake is open-source storage layer for reliable data lake. ACID transactions, schema enforcement and time travel over Parquet.
Why Delta Lake¶
Solves inconsistent reads and missing schema enforcement with transaction log.
Key Features¶
- ACID transactions
- Schema enforcement/evolution
- Time travel
- MERGE (upsert)
from delta import DeltaTable
df.write.format("delta").save("/data/orders")
# Delta Lake — ACID Transactions for Data Lake
spark.read.format("delta").option("versionAsOf", 5).load("/data/orders")
# MERGE
dt = DeltaTable.forPath(spark, "/data/orders")
dt.alias("t").merge(new.alias("s"), "t.order_id = s.order_id")\
.whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
OPTIMIZE delta.`/data/orders` ZORDER BY (customer_id)
VACUUM delta.`/data/orders` RETAIN 168 HOURS
Optimization and Maintenance¶
The OPTIMIZE command compacts small files into larger ones, speeding up reads. ZORDER rearranges data by specified columns for more efficient data skipping — if you frequently filter by customer_id, ZORDER on that column significantly reduces the amount of data read. VACUUM deletes old file versions that are no longer needed for time travel.
In practice, Delta Lake is often combined with Apache Spark for both batch and streaming processing. Unity Catalog (Databricks) or HMS (Hive Metastore) serves as the metadata catalog. Delta Lake supports schema evolution — adding a column without rewriting existing data. For migrating from raw Parquet, simply convert existing files using the CONVERT TO DELTA command.
Summary¶
Delta Lake adds warehouse reliability to data lake. Foundation of lakehouse architecture.