aokolnychyi commented on PR #8579: URL: https://github.com/apache/iceberg/pull/8579#issuecomment-1899444592
@rdblue recently pointed me to the Bloom filter [spec](https://github.com/apache/parquet-format/blob/master/BloomFilter.md) in Parquet. I think it contains a few interesting ideas that may be applicable to us. First of all, we should evaluate other hash functions apart from Murmur3. Parquet, for instance, uses xxHash that is supposed to be much faster. Second, Parquet avoids the modulo operator for performance reasons. Given all this information, I suggest we make this PR about multi-arg transforms in general (like how they are stored, how they are serialized, what happens during schema evolution, compatibility etc) and submit another one with `bucketV2` that will not only support multiple input elements but also be faster. If we merge a general change about multi-arg transforms, we can start working on changes to the expression API. @advancedxy @szehon-ho, how does this sound? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org