Re: [PR] Spark, API: Enhance hashing efficiency by operating on raw UTF-8 bytes [iceberg]

via GitHub Thu, 27 Mar 2025 09:22:01 -0700


amogh-jahagirdar commented on code in PR #12657:
URL: https://github.com/apache/iceberg/pull/12657#discussion_r2017044243



##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java:
##########
@@ -214,12 +214,11 @@ public static Integer invoke(int numBuckets, UTF8String 
value) {
         return null;
       }
 
-      // TODO - We can probably hash the bytes directly given they're already 
UTF-8 input.
-      return apply(numBuckets, hash(value.toString()));
+      return apply(numBuckets, hash(value.getBytes()));
     }
 
     // Visible for testing
-    public static int hash(String value) {
+    public static int hash(byte[] value) {
       return BucketUtil.hash(value);
     }

Review Comment:
   Even though it's a bit duplicative, I think I'd prefer to keep the existing 
`hash(String value)` function mainly for validation in the test that 
hash(String) == hash(utf8Bytes). My rationale is that since hash(String) has 
been around a while and we want real confidence that the new values being 
produced are the same, and imo the way to get that confidence is to directly 
compare the new hash output against the existing hash output. Compatibility is 
a secondary benefit here just because we don't really care too much about 
compatibility in this module imo. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark, API: Enhance hashing efficiency by operating on raw UTF-8 bytes [iceberg]

Reply via email to