Re: [I] Support bucket transform on multiple data columns [iceberg]

via GitHub Wed, 13 Dec 2023 17:42:54 -0800


pdames commented on issue #5626:
URL: https://github.com/apache/iceberg/issues/5626#issuecomment-1854969584

+1 to adding support for this feature. On my side, the principal use-case
where this is beneficial is when hashing a composite primary key column where
the cardinality of each individual primary key column is unknown. In this case,
we rely on (1) a uniform random hash distribution of all composite primary key
values and (2) configuring the number of hash buckets such that all records for
each bucket fit on a single node during a distributed shuffle.

This allows us to optimize operations like distributed joins, deduplication,
merge by primary key or any other composite unique key, etc. A bit more on the
merge-by-primary-key use-case and subsequent optimizations as implemented in
Ray can also be reviewed at
https://github.com/ray-project/deltacat/blob/main/deltacat/compute/compactor/TheFlashCompactorDesign.pdf.
We would like to be able to run our open source implementation of this design
in Ray across Iceberg tables with equivalent efficiency, and I believe that the
resolution of this issue would help here.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Support bucket transform on multiple data columns [iceberg]

Reply via email to