[I] Optimize Multi-Partition MERGE Operations with Partition-Level Parallelism [iceberg]

via GitHub Sun, 31 Aug 2025 00:35:40 -0700


bk-mz opened a new issue, #13957:
URL: https://github.com/apache/iceberg/issues/13957


   ### Feature Request / Improvement
   
   ## Problem Statement
   
   When executing MERGE operations that affect a large number of partitions, 
Iceberg currently processes the entire operation atomically as a single logical 
operation.
   
   This means all affected partitions are included in one execution plan, 
requiring Spark to shuffle data across all partitions simultaneously.
   
   The planner cannot isolate partition-specific operations, even though 
partitions are physically independent storage units.
   
   This creates excessive shuffle overhead in Spark that scales poorly with the 
number of affected partitions, even when data volume is modest.
   
   We've observed that this shuffle dominates processing time, and reducing 
batch size doesn't help if the partition count remains high.
   
   ## Proposed Enhancement
   
   Enhance Iceberg to natively support partition-aware parallelism for MERGE 
operations:
   
   1. Add capability to automatically split MERGE operations by partition 
boundaries.
   2. Leverage partition independence to reduce shuffle scope per operation.
   3. Process partition groups concurrently while maintaining proper commit 
semantics.
   
   ## Impact
   
   This enhancement would:
   
   - Improve performance for workloads with high partition counts.
   - Reduce resource consumption for merge operations.
   - Better handle late-arriving data affecting historical partitions.
   - Eliminate need for application-level workarounds.
   
   ## Current Workaround
   
   Workaround would be an application-level solution that:
   
   1. Splits batch into partition-specific chunks.
   2. Processes chunks in parallel with controlled concurrency. 
        - Each chunk contains a small subset of partitions to update
        - Each chunk runs a separate MERGE operation on its partition set
        - Multiple chunks execute concurrently up to a configured limit
   4. Reduces shuffle cost per operation by limiting partition scope.
   ### Downsides of Workaround
   
   1. **Increased Complexity**: Requires custom application logic to handle 
chunking, concurrency control, and error handling.
   2. **Metadata Bloat**: Multiple small commits generate more snapshots, 
requiring more aggressive compaction and cleanup.
   3. **Consistency Challenges**: Application must ensure idempotent processing 
across job restarts when chunks are partially processed.
   4. **Parameter Tuning**: Requires careful configuration of commit retry 
parameters to handle increased concurrency.
   5. **Limited Optimization**: Application has less visibility into underlying 
storage than Iceberg itself would.
   6. **Resource Contention**: Concurrent operations can lead to resource 
contention at the metadata service level.
   
   --- 
   
   Iceberg version 1.9.1.
   Spark Version 3.5.5.
   
   ### Query engine
   
   Spark
   
   ### Willingness to contribute
   
   - [ ] I can contribute this improvement/feature independently
   - [ ] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [x] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Optimize Multi-Partition MERGE Operations with Partition-Level Parallelism [iceberg]

Reply via email to