_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

Parquet, Avro, ORC — Serialization Formats for Data Engineering

24. 09. 2025 1 min read intermediate

Choosing correct data format dramatically impacts performance and costs. Parquet for analytics, Avro for streaming, ORC for Hive and JSON for flexibility.

Formats for Data Engineering

Apache Parquet

  • Columnar format — ideal for analytical queries
  • Compression — Snappy, ZSTD, excellent compression ratio
  • Predicate pushdown — statistics in metadata
  • Usage: data lake, warehouse, batch processing

Apache Avro

  • Row format — efficient for complete records
  • Schema evolution — native support
  • Compact binary — schema stored in header
  • Usage: Kafka messaging, CDC, streaming

Comparison

# Parquet: analytical queries
df.write.parquet("/data/orders.parquet")

# Avro: streaming/messaging
df.write.format("avro").save("/data/orders.avro")

# Size (1M rows):
# CSV:     500 MB
# JSON:    400 MB
# Avro:    100 MB
# Parquet:  50 MB (columnar compression)

When to Use Which Format

  • Analytics/batch → Parquet
  • Streaming/messaging → Avro
  • Hive ecosystem → ORC
  • API/logs → JSON (human-readable)

Summary

Parquet is default for analytics, Avro for streaming. Correct format choice saves storage and compute.

parquetavroorcserialization
Share:

CORE SYSTEMS tým

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.