amogh-jahagirdar commented on code in PR #12657: URL: https://github.com/apache/iceberg/pull/12657#discussion_r2017044243
########## spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java: ########## @@ -214,12 +214,11 @@ public static Integer invoke(int numBuckets, UTF8String value) { return null; } - // TODO - We can probably hash the bytes directly given they're already UTF-8 input. - return apply(numBuckets, hash(value.toString())); + return apply(numBuckets, hash(value.getBytes())); } // Visible for testing - public static int hash(String value) { + public static int hash(byte[] value) { return BucketUtil.hash(value); } Review Comment: Even though it's a bit duplicative, I think I'd prefer to keep the existing `hash(String value)` function mainly for validation in the test that hash(String) == hash(utf8Bytes). My rationale is that since hash(String) has been around a while and we want real confidence that the new values being produced are the same, and imo the way to get that confidence is to directly compare the new hash output against the existing hash output. Compatibility is a secondary benefit here just because we don't really care too much about compatibility in this module imo. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org