Delta Lake is open-source storage layer for reliable data lake. ACID transactions, schema enforcement and time travel over Parquet.
Why Delta Lake¶
Solves inconsistent reads and missing schema enforcement with transaction log.
Key Features¶
- ACID transactions
- Schema enforcement/evolution
- Time travel
- MERGE (upsert)
from delta import DeltaTable
df.write.format("delta").save("/data/orders")
# Time travel
spark.read.format("delta").option("versionAsOf", 5).load("/data/orders")
# MERGE
dt = DeltaTable.forPath(spark, "/data/orders")
dt.alias("t").merge(new.alias("s"), "t.order_id = s.order_id")\
.whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
OPTIMIZE delta.`/data/orders` ZORDER BY (customer_id)
VACUUM delta.`/data/orders` RETAIN 168 HOURS
Summary¶
Delta Lake adds warehouse reliability to data lake. Foundation of lakehouse architecture.
delta lakeaciddata lakelakehouse