bk-mz commented on issue #13957: URL: https://github.com/apache/iceberg/issues/13957#issuecomment-3252354120
@RussellSpitzer 👋 SPJ is too fragile to be triggered, it won't work with non-materialized data such as we have when we do kafka streaming and always require same schema between both left and right parts. OFC if SPJ won't be having such restrictions, it all makes sense, but at the moment, it does not. This issue is about optimization. In case batch has to update a lot of partitions at once, why engine needs to have a huge shuffle before applying the deltas? Scoping the change to a single-partition update as a single task and resolving them independently and concurrently (partitions are isolated) seems like a way more optimized approach rather than sorting full dataset in memory. At least we were able to simulate this locally by splitting the batch. Downside is having much more commits per batch and requirement to have idempotent merges. But why this can't be solved on Iceberg level? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
