Proper partitioning strategy dramatically impacts query performance. Time-based partitioning for time-series, hash for even distribution and range for sequential data.
Why Partitioning¶
Without partitioning, engine scans entire table. Partitioning enables skipping unnecessary data (partition pruning).
Partitioning Types¶
- Time-based — most common, partitioning by date (day, month)
- Hash — even distribution by hash key
- Range — value ranges (A-M, N-Z)
- List — explicit value list (regions, categories)
# Spark: partitioning during write
df.write.format("delta") \
.partitionBy("year", "month") \
.save("/data/orders")
# Query with partition pruning
spark.read.format("delta").load("/data/orders") \
.filter("year = 2026 AND month = 2") # reads only 1 partition
Best Practices¶
- 1 GB+ per partition — too small partitions are counterproductive
- Max 10k partitions — too many = slow metadata scan
- Partition by filters — by most common WHERE conditions
Summary¶
Proper partitioning is key for performance. Choose by most common filters and keep partitions sufficiently large.
partitioningperformancedata lakeoptimization