pdames commented on issue #5626:
URL: https://github.com/apache/iceberg/issues/5626#issuecomment-1854969584

   +1 to adding support for this feature. On my side, the principal use-case 
where this is beneficial is when hashing a composite primary key column where 
the cardinality of each individual primary key column is unknown. In this case, 
we rely on (1) a uniform random hash distribution of all composite primary key 
values and (2) configuring the number of hash buckets such that all records for 
each bucket fit on a single node during a distributed shuffle.
   
   This allows us to optimize operations like distributed joins, deduplication, 
merge by primary key or any other composite unique key, etc. A bit more on the 
merge-by-primary-key use-case and subsequent optimizations as implemented in 
Ray can also be reviewed at 
https://github.com/ray-project/deltacat/blob/main/deltacat/compute/compactor/TheFlashCompactorDesign.pdf.
 We would like to be able to run our open source implementation of this design 
in Ray across Iceberg tables with equivalent efficiency, and I believe that the 
resolution of this issue would help here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to