Parquet, Avro, ORC — Serialization Formats for Data Engineering

Choosing correct data format dramatically impacts performance and costs. Parquet for analytics, Avro for streaming, ORC for Hive and JSON for flexibility.

Formats for Data Engineering¶

Apache Parquet¶

Columnar format — ideal for analytical queries
Compression — Snappy, ZSTD, excellent compression ratio
Predicate pushdown — statistics in metadata
Usage: data lake, warehouse, batch processing

Apache Avro¶

Row format — efficient for complete records
Schema evolution — native support
Compact binary — schema stored in header
Usage: Kafka messaging, CDC, streaming

Comparison¶

# Parquet: analytical queries
df.write.parquet("/data/orders.parquet")

# Avro: streaming/messaging
df.write.format("avro").save("/data/orders.avro")

# Size (1M rows):
# CSV:     500 MB
# JSON:    400 MB
# Avro:    100 MB
# Parquet:  50 MB (columnar compression)

When to Use Which Format¶

Analytics/batch → Parquet
Streaming/messaging → Avro
Hive ecosystem → ORC
API/logs → JSON (human-readable)

Summary¶

Parquet is default for analytics, Avro for streaming. Correct format choice saves storage and compute.

parquetavroorcserialization

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Všechny články

Parquet, Avro, ORC — Serialization Formats for Data Engineering

Formats for Data Engineering¶

Apache Parquet¶

Apache Avro¶

Comparison¶

When to Use Which Format¶

Summary¶

CORE SYSTEMS tým

Další know-how

Schema Registry — centrální správa schémat pro streaming

GlassFish clustering — high availability

MongoDB and the NoSQL revolution — the end of relational databases?