aokolnychyi commented on issue #6679:
URL: https://github.com/apache/iceberg/issues/6679#issuecomment-1410945521

   I would be careful with `range` as it may cause performance regressions. 
Especially, for MERGE. The range distribution requires sampling that leads to 
double scanning and re-evaluating of particular nodes in the plan. This will 
cause the same issues we have today where the default would perform poorly. 
   
   The upcoming Spark 3.4 has support for rebalancing partitions via AQE for 
hash distributions requested by v2 writes. That means, we can request a hash 
distribution without worrying about having too much data per task and OOM. I'd 
rather switch to `hash` as default and let users configure if it fails. I don't 
know a single use case where the range distribution performs well in MERGE at 
any reasonable scale.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to