tyronecai opened a new pull request, #15779: URL: https://github.com/apache/lucene/pull/15779
### Description <!-- If this is your first contribution to Lucene, please make sure you have reviewed the contribution guide. https://github.com/apache/lucene/blob/main/CONTRIBUTING.md --> Rehashing performance is improved by creating a temporary int array in `rehash` to sequentially compute the hashcodes of all terms at once. This improves memory locality, enhancing the performance of `pool.hash` computation, which in turn improves `rehash` performance, ultimately significantly improving `add` performance. The impact is that a temporary int array consumes additional memory (4 bytes * count), potentially an extra 40MB for 10 million data points, which should be acceptable. I also tested residing `int[] hashcodes` in the `BytesRefHash` class, maintaining it during `add`, and using it for the results of `findHash`. For large datasets, this can slightly improve performance (5-6%), but the memory overhead and maintenance cost seem uneconomical. Thanks to OpenAI's Codex, I am able to validate my implementation very quickly, which is fantastic. ## test code ``` private static void insert(List<BytesRef> testData, int round) { for (int r = 0; r < round; r++) { BytesRefHash hash = new BytesRefHash(); int uniqueCount = 0; long start = System.nanoTime(); for (BytesRef ref : testData) { int pos = hash.add(ref); if (pos >= 0) { uniqueCount += 1; } } long insertTimeNs = System.nanoTime() - start; System.out.printf( "Inserted %d terms in %.2f ms, unique term %d\n", testData.size(), insertTimeNs / 1_000_000.0, uniqueCount); System.out.printf( "rehashTimes %d, rehashTimeMs %d, calcHashTimeMs %d\n", hash.rehashTimes, hash.rehashTimeMs, hash.calcHashTimeMs); } } ``` ## BytesRefHash.java ``` public long rehashTimes = 0l; public long rehashTimeMs = 0l; public long calcHashTimeMs = 0l; public int add(BytesRef bytes) { assert bytesStart != null : "Bytesstart is null - not initialized"; final int hashcode = doHash(bytes.bytes, bytes.offset, bytes.length); // final position final int hashPos = findHash(bytes, hashcode); int e = ids[hashPos]; if (e == -1) { // new entry if (count >= bytesStart.length) { bytesStart = bytesStartArray.grow(); assert count < bytesStart.length + 1 : "count: " + count + " len: " + bytesStart.length; } bytesStart[count] = pool.addBytesRef(bytes); e = count++; assert ids[hashPos] == -1; ids[hashPos] = e | (hashcode & highMask); if (count == hashHalfSize) { rehashTimes += 1; // <-- long start = System.nanoTime(); rehash(2 * hashSize, true); rehashTimeMs += (System.nanoTime() - start) / 1_000_000.0; } return e; } return -((e & hashMask) + 1); } ``` # test with large amounts (12585302 unique terms) ## original (with large amounts of data, `rehash` took more than half the time of the entire insert operation.) ``` round#32 Inserted 12585302 terms in 2270.57 ms, unique term 12585302 rehashTimes 21, rehashTimeMs 1503, calcHashTimeMs 0 round#33 Inserted 12585302 terms in 2255.78 ms, unique term 12585302 rehashTimes 21, rehashTimeMs 1484, calcHashTimeMs 0 round#34 Inserted 12585302 terms in 2267.66 ms, unique term 12585302 rehashTimes 21, rehashTimeMs 1499, calcHashTimeMs 0 round#35 Inserted 12585302 terms in 2253.70 ms, unique term 12585302 rehashTimes 21, rehashTimeMs 1481, calcHashTimeMs 0 ``` ## with precompute ``` round#92 Inserted 12585302 terms in 1090.96 ms, unique term 12585302 rehashTimes 21, rehashTimeMs 338, calcHashTimeMs 136 round#93 Inserted 12585302 terms in 1099.85 ms, unique term 12585302 <-- (2255.78 - 1099.85) / 2255.78 = 0.512 rehashTimes 21, rehashTimeMs 342, calcHashTimeMs 135 round#94 Inserted 12585302 terms in 1096.58 ms, unique term 12585302 rehashTimes 21, rehashTimeMs 338, calcHashTimeMs 135 round#95 Inserted 12585302 terms in 1109.92 ms, unique term 12585302 rehashTimes 21, rehashTimeMs 342, calcHashTimeMs 135 ``` # test with medium amounts (915436 of 17088421 unique terms) ## original (with medium amounts of data, precompute has little effect, but it doesn't make things worse. ) ``` round#34 Inserted 17088421 terms in 331.01 ms, unique term 915436 rehashTimes 17, rehashTimeMs 31, calcHashTimeMs 0 round#35 Inserted 17088421 terms in 322.89 ms, unique term 915436 rehashTimes 17, rehashTimeMs 29, calcHashTimeMs 0 round#36 Inserted 17088421 terms in 323.09 ms, unique term 915436 rehashTimes 17, rehashTimeMs 30, calcHashTimeMs 0 round#37 Inserted 17088421 terms in 319.32 ms, unique term 915436 rehashTimes 17, rehashTimeMs 29, calcHashTimeMs 0 ``` ## with precompute ``` round#43 Inserted 17088421 terms in 308.29 ms, unique term 915436 rehashTimes 17, rehashTimeMs 16, calcHashTimeMs 7 round#44 Inserted 17088421 terms in 307.25 ms, unique term 915436 <-- (322.89 - 307.25) / 322.89 = 0.048 rehashTimes 17, rehashTimeMs 16, calcHashTimeMs 7 round#45 Inserted 17088421 terms in 307.91 ms, unique term 915436 rehashTimes 17, rehashTimeMs 16, calcHashTimeMs 7 round#46 Inserted 17088421 terms in 302.33 ms, unique term 915436 rehashTimes 17, rehashTimeMs 16, calcHashTimeMs 7 ``` @dweiss @mikemccand please take a look and give some advice -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
