Skip to content
_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN DE
Let's talk

Apache Hudi — Incremental Processing in Data Lake

24. 10. 2024 Updated: 27. 03. 2026 1 min read intermediate

Apache Hudi is designed for efficient data updates in a data lake. Ideal for CDC pipelines with frequent upserts.

Hudi — Upserts in a Data Lake

Uber developed Hudi for billions of records with frequent updates.

Two Table Types

  • Copy-on-Write — rewrites the file; optimal for reads
  • Merge-on-Read — delta logs; optimal for writes
hudi_opts = {
    'hoodie.table.name': 'orders',
    'hoodie.datasource.write.recordkey.field': 'order_id',
    'hoodie.datasource.write.precombine.field': 'updated_at',
    'hoodie.datasource.write.operation': 'upsert',
}

df.write.format("hudi").options(**hudi_opts)\
    .mode("append").save("/data/hudi/orders")

# Apache Hudi — Incremental Processing in Data Lake
spark.read.format("hudi")\
    .option("hoodie.datasource.query.type", "incremental")\
    .load("/data/hudi/orders")

When to Use Hudi

Apache Hudi is an ideal choice for scenarios where you need to regularly update existing records in a data lake — for example, synchronizing with a production database via CDC (Change Data Capture). Unlike the traditional approach of rewriting entire Parquet files, Hudi enables efficient upserts and incremental reads of only changed data.

In practice, Hudi is often deployed alongside Debezium and Apache Kafka, where Debezium captures changes from PostgreSQL or MySQL and Hudi writes them to S3 or HDFS. With support for timeline and rollback mechanisms, Hudi also provides reliability at the level of ACID transactions. If your data pipeline processes millions of records daily and you need near-real-time access to current data, Hudi is the right choice.

Summary

Hudi is optimal for CDC and frequent updates. CoW and MoR strategies balance reads vs writes.

apache hudidata lakeupsertcdc
Share:

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.