singhpk234 opened a new pull request, #7128:
URL: https://github.com/apache/iceberg/pull/7128

   ### About the change
   
   Presently we use "%08x" to get, implies "it will produce a 8 digits hex 
number, padded by preceding zeros". This effectively means the distribution 
will be skewed, also since we are relying on hex number our character set is 
any ways limited to [0-9][A-F].
   
   This change attempts to use a wider character set as well as meantime making 
sure the distribution of first character remains as much uniform as possible.
   
   Sample code for distribution : 
   ```
     @Test
     public void distributionOfFirstChar() {
       Function<Object, Integer> HASH_FUNC =
               
Transforms.bucket(Integer.MAX_VALUE).bind(Types.StringType.get());
       Map<String, Integer> hm = Maps.newHashMap();
       for (int i = 0; i < 1000000; ++i) {
         String randomUUID = UUID.randomUUID().toString();
         //String hashFunc = String.format("%08x", HASH_FUNC.apply(randomUUID));
         String hashFunc = HashUtils.computeHash(randomUUID);
         String firstChar = hashFunc.substring(0, 1);
         hm.put(firstChar, (hm.getOrDefault(firstChar, 0) + 1));
       }
   
       for (String key : hm.keySet()) {
         System.out.println(String.format("hm[%s] = %s", key, hm.get(key)));
       }
     }
   ```
   
   Distribution of first character before (10M UUID String) : 
   it's being restricted to only [0-7]
   hm[0] = 125099
   hm[1] = 124953
   hm[2] = 125440
   hm[3] = 124705
   hm[4] = 124777
   hm[5] = 125103
   hm[6] = 124908
   hm[7] = 125015
   
   
   Distribution of first character after this change (10M UUID String): 
   hm[0] = 15715
   hm[1] = 15524
   hm[2] = 15861
   hm[3] = 15680
   hm[4] = 15411
   hm[5] = 15638
   hm[6] = 19410
   hm[7] = 19298
   hm[8] = 19472
   hm[9] = 19399
   hm[A] = 15661
   hm[B] = 15633
   hm[C] = 15414
   hm[D] = 15675
   hm[E] = 15711
   hm[F] = 15569
   hm[G] = 15767
   hm[H] = 15643
   hm[I] = 15616
   hm[J] = 15508
   hm[K] = 15636
   hm[L] = 15726
   hm[M] = 15701
   hm[N] = 15658
   hm[O] = 15525
   hm[P] = 15646
   hm[Q] = 15686
   hm[R] = 15666
   hm[S] = 15675
   hm[T] = 15521
   hm[U] = 15569
   hm[V] = 15613
   hm[W] = 15398
   hm[X] = 15797
   hm[Y] = 15855
   hm[Z] = 15620
   hm[a] = 19318
   hm[b] = 19460
   hm[c] = 19579
   hm[d] = 19673
   hm[e] = 15790
   hm[f] = 15687
   hm[g] = 15622
   hm[h] = 15833
   hm[i] = 15693
   hm[j] = 15547
   hm[k] = 15725
   hm[l] = 15521
   hm[m] = 15911
   hm[n] = 15468
   hm[o] = 15579
   hm[p] = 15753
   hm[q] = 15594
   hm[r] = 15723
   hm[s] = 15628
   hm[t] = 15433
   hm[u] = 15645
   hm[v] = 15544
   hm[w] = 15761
   hm[x] = 15524
   hm[y] = 15565
   hm[z] = 15527
   
   
   More resources : 
   1. 
https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to