kaivalnp commented on PR #14874:
URL: https://github.com/apache/lucene/pull/14874#issuecomment-3079893943

   > I'll try to dig deeper on why this is happening..
   
   @msokolov I tried what you mentioned 
[above](https://github.com/apache/lucene/pull/14874#issuecomment-3057200869), 
using the following hack:
   - Create a clone of [this 
function](https://github.com/apache/lucene/blob/d8b52ade0caee2e0505eead83bd4d6be859a6472/lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java#L311)
   - Change [this 
line](https://github.com/apache/lucene/blob/d8b52ade0caee2e0505eead83bd4d6be859a6472/lucene/core/src/java24/org/apache/lucene/internal/vectorization/Lucene99MemorySegmentByteVectorScorer.java#L115)
 to use the cloned function
   
   ..and the performance changed drastically!
   
   `main`:
   ```
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  
index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
    0.961        1.836   1.835        0.999  100000   100      50       64      
  250         no     11.65       8582.22           24.70             1          
 77.47       292.969      292.969       HNSW
   ```
   
   after the hack:
   ```
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  
index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
    0.961        1.201   1.200        0.999  100000   100      50       64      
  250         no     11.67       8569.71           50.90             1          
 77.47       292.969      292.969       HNSW
   ```
   
   Note that search time is \~35% faster, while force-merge time is \~106% 
slower!
   
   Looks like the JVM is producing optimized branches of code based on the 
underlying input type(s) of `int dotProduct(MemorySegment, MemorySegment)` -- 
and the non-optimized branches suffer a latency regression..
   
   On `main`, looks like the indexing version was optimized, while the search 
version was optimized after the hack
   
   Had the following questions:
   1. Is this also an issue in long-running applications, or just a benchmark 
issue with `luceneutil`?
   2. How can we refactor functions around so that the optimal case is always 
used? Perhaps separate out Panama / Vector API usages?
   3. Also, does the issue disproportionately affect applications where 
indexing and search happen on the same node? (v/s applications with separate 
writers / searchers -- and both internally execute their own optimized branches)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to