[PR] Optimize rehash in BytesRefHash, boost add performance [lucene]

via GitHub Fri, 27 Feb 2026 22:11:32 -0800


tyronecai opened a new pull request, #15779:
URL: https://github.com/apache/lucene/pull/15779


   
   ### Description
   
   <!--
   If this is your first contribution to Lucene, please make sure you have 
reviewed the contribution guide.
   https://github.com/apache/lucene/blob/main/CONTRIBUTING.md
   -->
   
   
   Rehashing performance is improved by creating a temporary int array in 
`rehash` to sequentially compute the hashcodes of all terms at once.
   
   This improves memory locality, enhancing the performance of `pool.hash` 
computation, which in turn improves `rehash` performance, ultimately 
significantly improving `add` performance.
   
   The impact is that a temporary int array consumes additional memory (4 bytes 
* count), potentially an extra 40MB for 10 million data points, which should be 
acceptable.
   
   I also tested residing `int[] hashcodes` in the `BytesRefHash` class, 
maintaining it during `add`, and using it for the results of `findHash`. For 
large datasets, this can slightly improve performance (5-6%), but the memory 
overhead and maintenance cost seem uneconomical.
   
   Thanks to OpenAI's Codex, I am able to validate my implementation very 
quickly, which is fantastic.
   
   ## test code
   ```
   private static void insert(List<BytesRef> testData, int round) {
       for (int r = 0; r < round; r++) {
         BytesRefHash hash = new BytesRefHash();
         int uniqueCount = 0;
         long start = System.nanoTime();
         for (BytesRef ref : testData) {
           int pos = hash.add(ref);
           if (pos >= 0) {
             uniqueCount += 1;
           }
         }
         long insertTimeNs = System.nanoTime() - start;
         System.out.printf(
             "Inserted %d terms in %.2f ms, unique term %d\n",
             testData.size(), insertTimeNs / 1_000_000.0, uniqueCount);
         System.out.printf(
             "rehashTimes %d, rehashTimeMs %d, calcHashTimeMs %d\n",
             hash.rehashTimes, hash.rehashTimeMs, hash.calcHashTimeMs);
       }
     }
   ```
   
   ## BytesRefHash.java
   ```
     public long rehashTimes = 0l;
     public long rehashTimeMs = 0l;
     public long calcHashTimeMs = 0l;
   
     public int add(BytesRef bytes) {
       assert bytesStart != null : "Bytesstart is null - not initialized";
       final int hashcode = doHash(bytes.bytes, bytes.offset, bytes.length);
       // final position
       final int hashPos = findHash(bytes, hashcode);
       int e = ids[hashPos];
   
       if (e == -1) {
         // new entry
         if (count >= bytesStart.length) {
           bytesStart = bytesStartArray.grow();
           assert count < bytesStart.length + 1 : "count: " + count + " len: " 
+ bytesStart.length;
         }
         bytesStart[count] = pool.addBytesRef(bytes);
         e = count++;
         assert ids[hashPos] == -1;
         ids[hashPos] = e | (hashcode & highMask);
   
         if (count == hashHalfSize) {
           rehashTimes += 1;                               // <-- 
           long start = System.nanoTime();
           rehash(2 * hashSize, true);
           rehashTimeMs += (System.nanoTime() - start) / 1_000_000.0;
         }
         return e;
       }
       return -((e & hashMask) + 1);
     }
   ```
   
   # test with large amounts (12585302 unique terms)
   
   ## original (with large amounts of data, `rehash` took more than half the 
time of the entire insert operation.)
   ```
   round#32
   Inserted 12585302 terms in 2270.57 ms, unique term 12585302
   rehashTimes 21, rehashTimeMs 1503, calcHashTimeMs 0 
   round#33
   Inserted 12585302 terms in 2255.78 ms, unique term 12585302
   rehashTimes 21, rehashTimeMs 1484, calcHashTimeMs 0
   round#34
   Inserted 12585302 terms in 2267.66 ms, unique term 12585302
   rehashTimes 21, rehashTimeMs 1499, calcHashTimeMs 0
   round#35
   Inserted 12585302 terms in 2253.70 ms, unique term 12585302
   rehashTimes 21, rehashTimeMs 1481, calcHashTimeMs 0
   ```
   
   ## with precompute
   ```
   round#92
   Inserted 12585302 terms in 1090.96 ms, unique term 12585302
   rehashTimes 21, rehashTimeMs 338, calcHashTimeMs 136
   round#93
   Inserted 12585302 terms in 1099.85 ms, unique term 12585302          <--  
(2255.78 - 1099.85) / 2255.78 = 0.512
   rehashTimes 21, rehashTimeMs 342, calcHashTimeMs 135
   round#94
   Inserted 12585302 terms in 1096.58 ms, unique term 12585302
   rehashTimes 21, rehashTimeMs 338, calcHashTimeMs 135
   round#95
   Inserted 12585302 terms in 1109.92 ms, unique term 12585302
   rehashTimes 21, rehashTimeMs 342, calcHashTimeMs 135
   ```
   
   
   
   # test with medium amounts  (915436 of 17088421 unique terms)
   
   ## original (with medium amounts of data, precompute has little effect, but 
it doesn't make things worse. )
   ```
   round#34
   Inserted 17088421 terms in 331.01 ms, unique term 915436
   rehashTimes 17, rehashTimeMs 31, calcHashTimeMs 0
   round#35
   Inserted 17088421 terms in 322.89 ms, unique term 915436
   rehashTimes 17, rehashTimeMs 29, calcHashTimeMs 0
   round#36
   Inserted 17088421 terms in 323.09 ms, unique term 915436
   rehashTimes 17, rehashTimeMs 30, calcHashTimeMs 0
   round#37
   Inserted 17088421 terms in 319.32 ms, unique term 915436
   rehashTimes 17, rehashTimeMs 29, calcHashTimeMs 0
   ```
   
   ## with precompute
   ```
   round#43
   Inserted 17088421 terms in 308.29 ms, unique term 915436
   rehashTimes 17, rehashTimeMs 16, calcHashTimeMs 7
   round#44
   Inserted 17088421 terms in 307.25 ms, unique term 915436             <--  
(322.89 - 307.25) / 322.89 = 0.048
   rehashTimes 17, rehashTimeMs 16, calcHashTimeMs 7
   round#45
   Inserted 17088421 terms in 307.91 ms, unique term 915436
   rehashTimes 17, rehashTimeMs 16, calcHashTimeMs 7
   round#46
   Inserted 17088421 terms in 302.33 ms, unique term 915436
   rehashTimes 17, rehashTimeMs 16, calcHashTimeMs 7
   ```
   
   @dweiss  @mikemccand    please take a look and give some advice
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Optimize rehash in BytesRefHash, boost add performance [lucene]

Reply via email to