Re: [PR] Spark, API: Enhance hashing efficiency by operating on raw UTF-8 bytes [iceberg]

via GitHub Thu, 27 Mar 2025 12:06:26 -0700


xiaoxuandev commented on code in PR #12657:
URL: https://github.com/apache/iceberg/pull/12657#discussion_r2017433032



##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/functions/BucketFunction.java:
##########
@@ -214,12 +214,11 @@ public static Integer invoke(int numBuckets, UTF8String 
value) {
         return null;
       }
 
-      // TODO - We can probably hash the bytes directly given they're already 
UTF-8 input.
-      return apply(numBuckets, hash(value.toString()));
+      return apply(numBuckets, hash(value.getBytes()));
     }
 
     // Visible for testing
-    public static int hash(String value) {
+    public static int hash(byte[] value) {
       return BucketUtil.hash(value);
     }

Review Comment:
   Thanks Amogh for the review! That makes sense. Keeping `hash(String value)` 
for validation ensures correctness and avoids regressions. I will add a test 
case to verify that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark, API: Enhance hashing efficiency by operating on raw UTF-8 bytes [iceberg]

Reply via email to