[I] Segment count (merging) can impact recall on KNN ParentJoin queries [lucene]

via GitHub Sat, 10 May 2025 17:10:10 -0700


vigyasharma opened a new issue, #14643:
URL: https://github.com/apache/lucene/issues/14643


   I've been running benchmarks on the KNN parent-join query to get comparison 
numbers for multi-vectors (https://github.com/apache/lucene/pull/14173). I see 
a pretty notable difference in recall when merging was disabled on the writer. 
I would've expected latency to be somewhat impacted (although the impact here 
seems too high), but not recall. Creating an issue to dig more into this.
   
   #### Setup
   1. Both lucene and luceneutil jar are on `main` branch
   2. To disable merges, I configured the writer's merge policy to 
`NoMergePolicy.INSTANCE`. So while we still configure a 
`ConcurrentMergeScheduler`, the merge policy does not find any merges, 
effectively disabling merging. 
   More specifically, I added the following line to `KnnIndexer.java`:
       ```java
       iwc.setMergePolicy(NoMergePolicy.INSTANCE);
       ```
   3. There is no other change b/w the two setups compared here.
   
   #### Benchmark Results
   ```ruby
   # Parent Join Queries
   # merging enabled
    recall  latency(ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  
index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB) 
 indexType
    0.228        4.697   500000   100      50       64        250         no    
113.17       4418.09             7         1473.92      1464.844     1464.844   
    HNSW
    0.179        3.043  1000000   100      50       64        250         no    
244.78       4085.27             5         2948.15      2929.688     2929.688   
    HNSW
    0.202        3.735  2000000   100      50       64        250         no    
469.05       4263.91             9         5896.90      5859.375     5859.375   
    HNSW
   
   # merges disabled: note num_segments value 
   recall  latency(ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  
index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB) 
 indexType
    0.378       13.976   500000   100      50       64        250         no    
107.52       4650.43            16         1473.82      1464.844     1464.844   
    HNSW
    0.415       21.928  1000000   100      50       64        250         no    
225.22       4440.12            32         2947.82      2929.688     2929.688   
    HNSW
    0.466       33.751  2000000   100      50       64        250         no    
478.83       4176.83            63         5896.20      5859.375     5859.375   
    HNSW
   
   ```
   
   ---
   
   This doesn't look like a problem with regular KNN vector queries, only 
appears with parent-join query benchmarks.
   ```ruby
   # Regular KNNFloatVectorQuery Benchmarks
   # merging enabled
    recall  latency(ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  
index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB) 
 indexType
    0.969       23.109   500000   100      50       64        250         no    
394.34       1267.93             8         1501.47      1464.844     1464.844   
    HNSW
    0.916       11.001  1000000   100      50       64        250         no   
1869.89        534.79             3         3017.29      2929.688     2929.688  
     HNSW
    0.951       30.394  2000000   100      50       64        250         no   
2756.49        725.56            10         6027.67      5859.375     5859.375  
     HNSW
   
   # merging disabled: : note num_segments value
    recall  latency(ms)    nDoc  topK  fanout  maxConn  beamWidth  quantized  
index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB) 
 indexType
    0.705       51.087   500000   100      50       64        250         no    
 95.62       5229.14            89         1489.43      1464.844     1464.844   
    HNSW
    0.960       90.863  1000000   100      50       64        250         no    
192.26       5201.37           175         2980.16      2929.688     2929.688   
    HNSW
    0.971      178.730  2000000   100      50       64        250         no    
396.67       5042.00           346         5962.90      5859.375     5859.375   
    HNSW
   ```
   
   Recall and latency with merges disabled is comparable if I increase 
`setRAMBufferSizeMB` for the writer and create fewer segments.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] Segment count (merging) can impact recall on KNN ParentJoin queries [lucene]

Reply via email to