bk-mz commented on issue #13957: URL: https://github.com/apache/iceberg/issues/13957#issuecomment-3263549597
>That's what SPJ is, it's scoping the changes to single tasks per partition. Yes, I agree, SPJ is something that should help with the issue, but at the moment it's rather niche and fragile approach. First of all it requires both sides of the join to be materialized and deduplicated by key, which is already a challenge with large-scaled tables since big data materialization from streaming sources is typically a latency hit, which in turn is an SLO damage. But even if we materialize both sides, then slowly-changing-dimension merges such as `merge into ... when matched update ... when not matched insert` would _always_ be translated into full outer join. Full outer join must output every row from both sides, including those without a match. To know that a row is truly unmatched, the engine must see all rows for that join key from both inputs at the same time. SPJ keeps data in its existing storage partitions and avoids shuffles, so a large partition would need to be split across multiple tasks to handle skew. For full outer join that split is unsafe without reshuffling, because each task would only see a subset of keys and could not tell whether the matching row lives in another split. Spark therefore disables the skew splitting optimization for full outer join. The result is that with SPJ a single oversized partition pair must be processed as one task, which preserves correctness but leads to underperformance (data skews, spills, etc) when such partitions are large. >How exactly would it be triggered on the Iceberg level? I don't know. At least acknowledge that this (i.e. many partition update that forces a global shuffle) is a problem and that SPJ is not a silver bullet. Well, yes, we have a workaround that introduces splits-per-batch and therefore reduces shuffle footprint, but why this can't be solved on reconfiguring the iceberg merge plans in the engine? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
