Re: [I] Optimize Multi-Partition MERGE Operations with Partition-Level Parallelism [iceberg]

via GitHub Sun, 07 Sep 2025 00:13:28 -0700


bk-mz commented on issue #13957:
URL: https://github.com/apache/iceberg/issues/13957#issuecomment-3263549597


   >That's what SPJ is, it's scoping the changes to single tasks per partition.
   
   Yes, I agree, SPJ is something that should help with the issue, but at the 
moment it's rather niche and fragile approach.
   
   First of all it requires both sides of the join to be materialized and 
deduplicated by key, which is already a challenge with large-scaled tables 
since big data materialization from streaming sources is typically a latency 
hit, which in turn is an SLO damage.
   
   But even if we materialize both sides, then slowly-changing-dimension merges 
such as `merge into ... when matched update ... when not matched insert` would 
_always_ be translated into full outer join.
   
   Full outer join must output every row from both sides, including those 
without a match. To know that a row is truly unmatched, the engine must see all 
rows for that join key from both inputs at the same time. SPJ keeps data in its 
existing storage partitions and avoids shuffles, so a large partition would 
need to be split across multiple tasks to handle skew. For full outer join that 
split is unsafe without reshuffling, because each task would only see a subset 
of keys and could not tell whether the matching row lives in another split. 
   
   Spark therefore disables the skew splitting optimization for full outer 
join. The result is that with SPJ a single oversized partition pair must be 
processed as one task, which preserves correctness but leads to 
underperformance (data skews, spills, etc) when such partitions are large.
   
   >How exactly would it be triggered on the Iceberg level? 
   I don't know. At least acknowledge that this (i.e. many partition update 
that forces a global shuffle) is a problem and that SPJ is not a silver bullet. 
   
   Well, yes, we have a workaround that introduces splits-per-batch and 
therefore reduces shuffle footprint, but why this can't be solved on 
reconfiguring the iceberg merge plans in the engine? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Optimize Multi-Partition MERGE Operations with Partition-Level Parallelism [iceberg]

Reply via email to