Re: [I] Optimize Multi-Partition MERGE Operations with Partition-Level Parallelism [iceberg]

via GitHub Thu, 04 Sep 2025 01:10:45 -0700


bk-mz commented on issue #13957:
URL: https://github.com/apache/iceberg/issues/13957#issuecomment-3252354120


   @RussellSpitzer 👋 
   
   SPJ is too fragile to be triggered, it won't work with non-materialized data 
such as we have when we do kafka streaming and always require same schema 
between both left and right parts.
   
   OFC if SPJ won't be having such restrictions, it all makes sense, but at the 
moment, it does not. 
   
   This issue is about optimization. In case batch has to update a lot of 
partitions at once, why engine needs to have a huge shuffle before applying the 
deltas?
   
   Scoping the change to a single-partition update as a single task and 
resolving them independently and concurrently (partitions are isolated) seems 
like a way more optimized approach rather than sorting full dataset in memory. 
At least we were able to simulate this locally by splitting the batch. Downside 
is having much more commits per batch and requirement to have idempotent merges.
   
   But why this can't be solved on Iceberg level? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Optimize Multi-Partition MERGE Operations with Partition-Level Parallelism [iceberg]

Reply via email to