Choosing correct data format dramatically impacts performance and costs. Parquet for analytics, Avro for streaming, ORC for Hive and JSON for flexibility.
Formats for Data Engineering¶
Apache Parquet¶
- Columnar format — ideal for analytical queries
- Compression — Snappy, ZSTD, excellent compression ratio
- Predicate pushdown — statistics in metadata
- Usage: data lake, warehouse, batch processing
Apache Avro¶
- Row format — efficient for complete records
- Schema evolution — native support
- Compact binary — schema stored in header
- Usage: Kafka messaging, CDC, streaming
Comparison¶
# Parquet: analytical queries
df.write.parquet("/data/orders.parquet")
# Avro: streaming/messaging
df.write.format("avro").save("/data/orders.avro")
# Size (1M rows):
# CSV: 500 MB
# JSON: 400 MB
# Avro: 100 MB
# Parquet: 50 MB (columnar compression)
When to Use Which Format¶
- Analytics/batch → Parquet
- Streaming/messaging → Avro
- Hive ecosystem → ORC
- API/logs → JSON (human-readable)
Summary¶
Parquet is default for analytics, Avro for streaming. Correct format choice saves storage and compute.
parquetavroorcserialization