aokolnychyi commented on issue #6679: URL: https://github.com/apache/iceberg/issues/6679#issuecomment-1410945521
I would be careful with `range` as it may cause performance regressions. Especially, for MERGE. The range distribution requires sampling that leads to double scanning and re-evaluating of particular nodes in the plan. This will cause the same issues we have today where the default would perform poorly. The upcoming Spark 3.4 has support for rebalancing partitions via AQE for hash distributions requested by v2 writes. That means, we can request a hash distribution without worrying about having too much data per task and OOM. I'd rather switch to `hash` as default and let users configure if it fails. I don't know a single use case where the range distribution performs well in MERGE at any reasonable scale. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org