Big Data

Tips on how to Select the Greatest Open Desk Format for AI/ML Workloads?

20 February 2025

In the event you’re working with AI/ML workloads(like me) and attempting to determine which knowledge format to decide on, this put up is for you. Whether or not you’re a scholar, analyst, or engineer, understanding the variations between Apache Iceberg, Delta Lake, and Apache Hudi can prevent a ton of complications in the case of efficiency, scalability, and real-time updates. By the tip of this information, you’ll have a strong grasp of core options and be capable of choose the most effective open desk format for AI/ML workloads. Let’s dive in!

Why Do We Want the Open Desk Format for AI/ML Workloads?

Conventional knowledge lakes have some limitations, and to handle these challenges, three main open desk codecs have been designed, I’ve added an structure diagram for every format later within the put up:

Apache Iceberg
Delta Lake
Apache Hudi

Key Advantages of These Codecs

These codecs tackle a number of the most important points with conventional knowledge lakes:

Lack of Acid Transactions: Iceberg, Delta Lake and Hudi resolve this making certain reliability, concurrent reads and concurrent writes.
No Previous Knowledge Monitoring: Iceberg, Delta Lake, and Hudi allow this by reproducing previous knowledge states for debugging, ML coaching, or auditing.
Knowledge & Metadata Scalability: All three codecs help real-time knowledge scalability by file compaction.

Comparability Based mostly On AI/ML Use Circumstances

Let’s check out the approaches of every format in key areas:

Function Shops: How effectively every format helps the information necessities for coaching ML fashions.
Mannequin Coaching: How effectively every format helps the information necessities for coaching ML fashions.
Scalable ML pipelines: How effectively every format handles large-scale knowledge processing.

Additionally learn: What’s Knowledge Lakes? Step -by -Step Information

What’s the Apache Iceberg?

(Apache Iceberg Architecture/High-Level Diagram) — Supply: Writer

Apache Iceberg open desk format has turn out to be an business normal for managing knowledge lakes and resolving the issues of the standard knowledge lake. It supplies excessive analytics on giant datasets.

When it comes to function shops, Apache Iceberg helps ACID transactions utilizing snapshot isolation to make sure concurrent writes and reliability. Furthermore, Iceberg permits schema modifications with out breaking current queries that means you don’t must rewrite the datasets to make modifications like used to do in conventional knowledge lakes. Iceberg helps time journey utilizing snapshots, permitting customers to question older variations. Iceberg reduces the poor question efficiency by hidden partitioning and metadata indexing to hurry up the question efficiency and it enhances knowledge group and entry effectivity.

When it comes to mannequin coaching, Iceberg helps ML knowledge necessities by optimizing quick knowledge retrieval for quicker mannequin coaching by supporting time journey and utilizing snapshot isolation making certain that knowledge stays constant and doesn’t get corrupted due to concurrent updates. It effectively filters knowledge by hidden partitioning to enhance question velocity and helps predicate pushdown, making certain ML frameworks like Spark, PyTorch, and TensorFlow load knowledge effectively. Iceberg permits schema evolution with out breaking queries supporting the evolving ML wants.

When it comes to scalable ML pipelines, its compatibility with numerous processing engines, reminiscent of Apache Spark, Flink, Trino, and Presto, supplies flexibility in constructing scalable ML pipelines. It helps quicker pipeline execution making certain shorter ML mannequin coaching cycles. Iceberg helps incremental knowledge processing so ML pipelines don’t must reprocess the complete dataset; they solely have to course of modified or new knowledge and that ends in price financial savings in a cloud atmosphere. Iceberg helps ACID transactions making certain protected concurrent writes and dependable ML knowledge pipelines, avoiding knowledge inconsistencies in distributed environments.

What’s Apache Delta Lake?

(Delta Lake Architecture/High-Level Diagram) — Supply: Writer

Apache Delta Lake, developed by the creators of Apache Spark – Databricks, is an open-source knowledge storage layer that integrates seamlessly with Spark for each studying and writing. It merges Apache Parquet knowledge information with a complicated metadata log and has deep integrations with Spark

When it comes to function shops, Delta Lake performs ACID transactions and handles concurrency to make sure that writes, updates, and deletes don’t lead to corrupt knowledge. To make sure enforceability and consistency inside Delta Lake, metadata layers to trace transactions. Moreover, Delta Lake prevents getting into dangerous knowledge into the desk by implementing desk restrictions and permitting for schema modifications. Nonetheless, some schema alterations, reminiscent of dropping columns, require cautious dealing with. Customers are in a position to question earlier variations of the information as a result of time journey performance enabled by the transaction log. Delta Lake optimizes question efficiency by using its metadata and transaction logs. Importantly, Delta Lake allows real-time modifications with the help of streaming writes. As well as, it solves price and storage issues by way of real-time file compaction.

The Delta Lake maintains dependable and versioned coaching knowledge with ACID transactions in mannequin coaching. ML fashions use the time journey and rollback function to coach on historic snapshots which improves reproducibility and debugging. Utilizing Z-ordering improves question efficiency and reduces I/O prices because it clusters related knowledge collectively. As well as, Delta Lake has been reported to enhance learn efficiency by way of partition pruning, metadata indexing, and Z-ordering. Lastly, Delta Lake retains supporting schema modifications with none impact on availability.

For scalable ML pipelines, the tight coupling of Delta Lake with Apache Spark makes it simpler to combine into current ML workflows. New knowledge is repeatedly ingested as a result of it helps real-time streaming with Spark Structured Streaming, which allows faster decision-making. Lastly, Delta Lake helps a number of ML groups to work on the identical dataset concurrently with out corruption due to ACID transactions.

What’s Apache Hudi?

(Apache Hudi Architecture/High-Level Diagram) — Supply: Writer

Apache Hudi enhances the Apache Knowledge Lake Stack with an open-sourced transactional storage layer that helps real-time analytics and incremental processing. Hudi permits knowledge lakes to help incremental processing enabling gradual batch processing to remodel into close to real-time analytics.

With regard to function shops, Hudi has ACID transactions enabled, and it’s attainable to trace occasions utilizing the commit timeline and metadata layers. Thus, there isn’t a probability of inconsistent knowledge ensuing from writes, updates, and deletes. Hudi permits some schema evolution, however sure schema modifications reminiscent of dropping columns require care in order to not break current queries. Hudi’s commit timeline additionally allows time journey and rollback performance, which helps querying older variations and rolling again modifications. As well as, Hudi’s question efficiency is improved by way of using a number of indexing methods, together with Bloom filters, and international and partition-level indexes. Hudi optimizes ceaselessly up to date tables utilizing the Merge-on-Learn (MoR) storage mannequin. Hudi permits streaming writes however doesn’t supply totally steady streaming like Delta Lake’s Spark Structured Streaming. As a substitute, Hudi works with micro-batch or incremental batch modes with integrations to Apache Kafka, Flink, and Spark Structured Streaming.

Hudi is nice for real-time machine studying implementations like fraud detection or advice techniques as a result of it allows real-time updates throughout mannequin coaching. It lowers the compute price as a result of the system solely has to load the altered knowledge as an alternative of reloading complete datasets. Merge-on-Learn incremental queries are seamlessly managed. The versatile ingestion modes optimize Hudi’s batch and real-time ML coaching and may help a number of ML pipelines concurrently.

As regards to scalable ML pipelines, Hudi was designed for streaming-first workloads; subsequently it will likely be most acceptable for AI/ML use instances the place knowledge must be up to date usually as in ad-bidding techniques. It has built-in small file administration options to stop efficiency bottlenecks. Hudi additionally permits environment friendly evolution over datasets by incorporating record-level updates and delete for each ML function shops and coaching pipelines.

ISSUE/FEATURE	ICEBERG	DELTA LAKE	HUDI
ACID Transactions & Consistency	Sure	Sure	Sure
Schema Evolution	Sure	Sure	Sure
Time Journey & Versioning	Sure	Sure	Sure
Question Optimization (Partitioning & Indexing)	Sure(Greatest)	Sure	Sure
Actual-time streaming help	No	Sure	Sure(Greatest)
Storage Optimization	Sure	Sure	Sure

Apache Iceberg vs. Delta Lake vs. Hudi: Which Open Desk Format you Ought to Select for AI/ML Workloads?

In the event you’ve made it this far, we’ve realized about a number of the necessary similarities and variations between Apache Iceberg, Delta Lake and Apache Hudi.

The time has come to resolve which format makes essentially the most sense in your use case! My advice is guided by which state of affairs is most relevant:

Iceberg: Go for Iceberg should you want environment friendly, large-scale batch processing with superior metadata administration, particularly if working with historic knowledge and requiring time journey.
Delta Lake: Greatest for real-time, streaming AI/ML workloads the place ACID transactions and incremental knowledge processing are essential.
Hudi: Excellent should you want high-frequency updates in real-time streaming AI/ML workloads and like extra fine-grained management over knowledge.

Conclusion

In case your major concern is streaming knowledge and real-time updates, then Delta Lake or Hudi could also be your most suitable option in Open Desk Format for AI/ML Workloads. Nonetheless, should you want superior knowledge administration, historic versioning, and batch processing optimization, Iceberg stands out. To be used instances that require each streaming and batch processing with record-level knowledge updates, Hudi is probably going the most suitable choice.

Sumit Gupta is an information science chief and printed writer with experience in analytics, knowledge modeling, and visualization. He’s enthusiastic about making knowledge accessible and actionable, with expertise at firms like Notion, Dropbox and Snowflake. Sumit additionally shares insights on SQL, dbt, and BI instruments to assist knowledge professionals degree up their abilities on his LinkedIn and Instagram web page