advancedxy commented on PR #8579:
URL: https://github.com/apache/iceberg/pull/8579#issuecomment-1899645844

   > First of all, we should evaluate other hash functions apart from Murmur3. 
Parquet, for instance, uses xxHash that is supposed to be much faster
   > Second, Parquet avoids the modulo operator for performance reasons.
   
   Both sounds great improvement to me. Apart from faster hash, I'd like to add 
another possible option to explore: user defined hash function for bucket 
transform while we are working `bucketV2`. From time to time, I got request 
from users that is it possible to custom Iceberg's bucket partitioning 
strategy, so that it has exactly the same distribution of downstream systems.
   
   
   >  If we merge a general change about multi-arg transforms, we can start 
working on changes to the expression API while figuring out the details about 
bucketV2.
   
   I'm ok to merge multi-arg transform first. However I'm not sure how to 
provide examples for single-arg transform v.s. multi-arg transform as there 
will be no `bucketV2` transform for now. I am referring this part:
   
   ```markdown
   |**`Partition Field`** [2]|`JSON object: {`<br />&nbsp;&nbsp;`"source-id": 
<id int>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br 
/>&nbsp;&nbsp;`"name": <name string>,`<br />&nbsp;&nbsp;`"transform": 
<transform JSON>`<br />`}`|`{`<br />&nbsp;&nbsp;`"source-id": 1,`<br 
/>&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": "id_bucket",`<br 
/>&nbsp;&nbsp;`"transform": "bucket[16]"`<br />`}`|
   |**`Partition Field with multi-arg transform`** [3]|`JSON object: {`<br 
/>&nbsp;&nbsp;`"source-id": -1,`<br />&nbsp;&nbsp;`"source-ids": <list of 
ids>,`<br />&nbsp;&nbsp;`"field-id": <field id int>,`<br />&nbsp;&nbsp;`"name": 
<name string>,`<br />&nbsp;&nbsp;`"transform": <transform JSON>`<br 
/>`}`|`{`<br />&nbsp;&nbsp;`"source-id": -1,`<br />&nbsp;&nbsp;`"source-ids": 
[1,2],`<br />&nbsp;&nbsp;`"field-id": 1000,`<br />&nbsp;&nbsp;`"name": 
"id_type_bucket",`<br />&nbsp;&nbsp;`"transform": "bucketV2[16]"`<br />`}`|
   ```
   @szehon-ho  @aokolnychyi do you have any suggestions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to