mikemccand commented on PR #15779:
URL: https://github.com/apache/lucene/pull/15779#issuecomment-3990897502

   > Thanks to OpenAI's Codex, I am able to validate my implementation very 
quickly, which is fantastic.
   
   +1 -- these genai tools are amazing.  Claude (Opus 4.6) helped me add 
ascii-art (well, Unicode) [sparkle histograms to visualize smelly vectors in 
`luceneutil`'s `knnPerfTest.py` (vector search 
benchmark).](https://github.com/mikemccand/luceneutil/blob/main/src/python/knnPerfTest.py)
   
   Wow, ~51% net topline (time to insert N keys) speedup by pre-computing all 
hashes into up-front `int[]` during rehash!?
   
   You are doing precisely the same amount of CPU work (1X hash computation for 
each `BytesRef` in the `BytesRefHash`), just doing it up front (this change) vs 
doing it interleaved along with the insert into the new larger hash table?  
I.e. we are not somehow saving further `hash()` calls done when stepping 
through collisions on insert.
   
   I guess reading the key from random packed `byte[]` location is 
cache-painful.  Similarly, writing into the bigger hash table is also 
cache-painful.  But if you try to do both at once -> cache thrashing (the two 
fight with each other, greatly reducing cache hit %).
   
   If you run with `perf stat -ddd` it'll probably show exactly this from its 
counters?
   
   I asked genai to look at the PR and explain the speedup.  [Claude Opus 
4.6](https://claude.ai/share/94566b7c-eccb-425e-bf2f-c74d7e40a0b4) did well.  
So did [Grok 
(Expert)](https://grok.com/share/c2hhcmQtMw_13d1dd8d-985d-48aa-9b94-b02143298a18).
  [Gemini (Thinking) was (surprisingly) not 
great](https://gemini.google.com/share/1b222bf6a692) -- it confusingly thought 
this PR introduced power-of-two hash table sizes (that is pre-existing), and 
that this PR switched from modulo math to bitmasks (also pre-existing).
   
   Anyways, I love this change.  How portable is it?  If you use [Lucene's 
`aws-jmh` benchmark 
infra](https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh) across 
various CPU flavors, do the gains hold up?  I expect on nightly benchmarking 
box (`beast3`) this would be big win -- it has four "chiplets" and 
inter-chiplet latency is much higher than within-chiplet and so I have to tell 
OS what Numa nodes to use, etc.  Do you have the benchy source you ran -- I'll 
test on `beast3`.
   
   But I'm worried about the surge in transient RAM.  It's especially bad 
timing because we are already surging to 3X the current hash table in 
transience (1X current one, 2X being rehashed into) -- surge on surge.  Could 
we change it to do that pre-computation in chunks?
   
   Also: since `BytesRefHash` now uses the otherwise 0 leading bits of each int 
in `int[] ids` array to hold some of the hash code bits, couldn't we often 
avoid recomputing the full hash entirely?  We know the lower bits of the hash 
(position in the current `ids` array), we know more bits from [the recent 
opto](https://github.com/apache/lucene/commit/2f66c8f6668622c0c82b720d47ccd43b57e32edb),
 don't we have enough bits?  We need just one more bit for the initial hash 
slot ... on linear probe it just increments... 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to