pdames commented on issue #5626: URL: https://github.com/apache/iceberg/issues/5626#issuecomment-1854969584
+1 to adding support for this feature. On my side, the principal use-case where this is beneficial is when hashing a composite primary key column where the cardinality of each individual primary key column is unknown. In this case, we rely on (1) a uniform random hash distribution of all composite primary key values and (2) configuring the number of hash buckets such that all records for each bucket fit on a single node during a distributed shuffle. This allows us to optimize operations like distributed joins, deduplication, merge by primary key or any other composite unique key, etc. A bit more on the merge-by-primary-key use-case and subsequent optimizations as implemented in Ray can also be reviewed at https://github.com/ray-project/deltacat/blob/main/deltacat/compute/compactor/TheFlashCompactorDesign.pdf. We would like to be able to run our open source implementation of this design in Ray across Iceberg tables with equivalent efficiency, and I believe that the resolution of this issue would help here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org