amogh-jahagirdar commented on code in PR #12657: URL: https://github.com/apache/iceberg/pull/12657#discussion_r2017048228
########## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkBucketFunction.java: ########## @@ -74,7 +74,7 @@ public void testSpecValues() { .as("Spec example: hash(2017-11-16T22:31:08) = -2047944441") .isEqualTo(-2047944441); - assertThat(new BucketFunction.BucketString().hash("iceberg")) + assertThat(new BucketFunction.BucketString().hash("iceberg".getBytes(StandardCharsets.UTF_8))) .as("Spec example: hash(\"iceberg\") = 1210000089") Review Comment: See comment above about keeping hash(String), instead of changing the existing assertion I think we should leave it as is and add a new assertion that asserts that hash("iceberg.getBytes(StandardCharsets.UTF_8)").equals(hash("iceberg")) ########## spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java: ########## @@ -214,12 +214,11 @@ public static Integer invoke(int numBuckets, UTF8String value) { return null; } - // TODO - We can probably hash the bytes directly given they're already UTF-8 input. - return apply(numBuckets, hash(value.toString())); + return apply(numBuckets, hash(value.getBytes())); } // Visible for testing - public static int hash(String value) { + public static int hash(byte[] value) { return BucketUtil.hash(value); } Review Comment: Even though it's a bit duplicative, I think I'd prefer to keep the existing `hash(String value)` function mainly for validation in the test that hash(String) == hash(utf8Bytes). My rationale is that since hash(String) has been around a while we want real confidence that the new values being produced are the same, and the way to get that confidence is to directly compare the new output against the existing output. Compatibility is a secondary benefit here just because we don't really care too much about compatibility in this module imo. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org