advancedxy commented on PR #8579: URL: https://github.com/apache/iceberg/pull/8579#issuecomment-1899645844
> First of all, we should evaluate other hash functions apart from Murmur3. Parquet, for instance, uses xxHash that is supposed to be much faster > Second, Parquet avoids the modulo operator for performance reasons. Both sounds great improvement to me. Apart from faster hash, I'd like to add another possible option to explore: user defined hash function for bucket transform while we are working `bucketV2`. From time to time, I got request from users that is it possible to custom Iceberg's bucket partitioning strategy, so that it has exactly the same distribution of downstream systems. > If we merge a general change about multi-arg transforms, we can start working on changes to the expression API while figuring out the details about bucketV2. I'm ok to merge multi-arg transform first. However I'm not sure how to provide examples for single-arg transform v.s. multi-arg transform as there will be no `bucketV2` transform for now. I am referring this part: ```markdown |**`Partition Field`** [2]|`JSON object: {`<br /> `"source-id": <id int>,`<br /> `"field-id": <field id int>,`<br /> `"name": <name string>,`<br /> `"transform": <transform JSON>`<br />`}`|`{`<br /> `"source-id": 1,`<br /> `"field-id": 1000,`<br /> `"name": "id_bucket",`<br /> `"transform": "bucket[16]"`<br />`}`| |**`Partition Field with multi-arg transform`** [3]|`JSON object: {`<br /> `"source-id": -1,`<br /> `"source-ids": <list of ids>,`<br /> `"field-id": <field id int>,`<br /> `"name": <name string>,`<br /> `"transform": <transform JSON>`<br />`}`|`{`<br /> `"source-id": -1,`<br /> `"source-ids": [1,2],`<br /> `"field-id": 1000,`<br /> `"name": "id_type_bucket",`<br /> `"transform": "bucketV2[16]"`<br />`}`| ``` @szehon-ho @aokolnychyi do you have any suggestions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org