kaivalnp commented on PR #12679:
URL: https://github.com/apache/lucene/pull/12679#issuecomment-1766995337

   ### Benchmarks
   
   Using the vector file from 
https://home.apache.org/~sokolov/enwiki-20120502-lines-1k-100d.vec (enwiki 
dataset, unit vectors, 100 dimensions)
   
   The setup was 1M doc vectors in a single HNSW graph with `DOT_PRODUCT` 
similarity, and 10K query vectors
   
   The baseline for the new objective is "all vectors above a score threshold" 
(as opposed to the best-scoring `topK` vectors in the current system) for a 
given query and is used to compute recall in all subsequent runs..
   
   Here are some statistics for the result counts in the new baseline:
   
   | threshold | mean     | stdDev    | min | p25 | p50  | p75    | p90    | 
p99    | max    |
   | --------- | -------- | --------- | --- | --- | ---- | ------ | ------ | 
------ | ------ |
   | 0.95      | 71877.73 | 109177.23 | 0   | 222 | 7436 | 116567 | 259135 | 
388113 | 483330 |
   | 0.96      | 32155.63 | 57183.83  | 0   | 30  | 3524 | 36143  | 120700 | 
235038 | 342959 |
   | 0.97      | 8865.48  | 19006.24  | 0   | 1   | 816  | 5483   | 29966  | 
92433  | 174163 |
   | 0.98      | 1010.10  | 2423.03   | 0   | 0   | 46   | 873    | 3234   | 
12175  | 40163  |
   | 0.99      | 136.47   | 465.91    | 0   | 0   | 0    | 2      | 77     | 
2296   | 2494   |
   
   This is used to get an estimate of query - result count distribution for 
various `threshold` values, and also gauge the corresponding `topK` to use for 
comparison with the new radius-based vector search API
   
   Here we will benchmark the new API against a high `topK` (+ filtering out 
results below the threshold after HNSW search)
   
   ### K-NN Search (current system)
   
   | maxConn | beamWidth | topK | threshold | mean   | numVisited | latency | 
recall |
   | ------- | --------- | ---- | --------- | ------ | ---------- | ------- | 
------ |
   | 16      | 100       | 500  | 0.99      | 46.39  | 4086       | 1.465   | 
0.34   |
   | 16      | 100       | 1000 | 0.99      | 83.92  | 6890       | 2.600   | 
0.61   |
   | 16      | 100       | 2000 | 0.99      | 129.56 | 11727      | 4.746   | 
0.95   |
   | 16      | 200       | 500  | 0.99      | 46.39  | 4504       | 1.535   | 
0.34   |
   | 16      | 200       | 1000 | 0.99      | 83.92  | 7564       | 2.759   | 
0.61   |
   | 16      | 200       | 2000 | 0.99      | 129.56 | 12805      | 5.007   | 
0.95   |
   | 32      | 100       | 500  | 0.99      | 46.39  | 4940       | 1.644   | 
0.34   |
   | 32      | 100       | 1000 | 0.99      | 83.92  | 8271       | 2.944   | 
0.61   |
   | 32      | 100       | 2000 | 0.99      | 129.56 | 13937      | 5.335   | 
0.95   |
   | 32      | 200       | 500  | 0.99      | 46.39  | 5654       | 1.890   | 
0.34   |
   | 32      | 200       | 1000 | 0.99      | 83.92  | 9401       | 3.320   | 
0.61   |
   | 32      | 200       | 2000 | 0.99      | 129.56 | 15707      | 5.987   | 
0.95   |
   | 64      | 100       | 500  | 0.99      | 46.39  | 5241       | 1.736   | 
0.34   |
   | 64      | 100       | 1000 | 0.99      | 83.92  | 8766       | 3.091   | 
0.61   |
   | 64      | 100       | 2000 | 0.99      | 129.56 | 14736      | 5.567   | 
0.95   |
   | 64      | 200       | 500  | 0.99      | 46.39  | 6095       | 1.992   | 
0.34   |
   | 64      | 200       | 1000 | 0.99      | 83.92  | 10119      | 3.535   | 
0.61   |
   | 64      | 200       | 2000 | 0.99      | 129.56 | 16852      | 6.365   | 
0.95   |
   
   ### R-NN Search (new system)
   
   | maxConn | beamWidth | traversalThreshold | threshold | mean   | numVisited 
| latency | recall |
   | ------- | --------- | ------------------ | --------- | ------ | ---------- 
| ------- | ------ |
   | 16      | 100       | 0.99               | 0.99      | 94.03  | 256        
| 0.129   | 0.69   |
   | 16      | 100       | 0.98               | 0.99      | 95.18  | 5171       
| 2.062   | 0.70   |
   | 16      | 200       | 0.99               | 0.99      | 89.96  | 263        
| 0.119   | 0.66   |
   | 16      | 200       | 0.98               | 0.99      | 91.09  | 5497       
| 2.207   | 0.67   |
   | 32      | 100       | 0.99               | 0.99      | 109.17 | 295        
| 0.135   | 0.80   |
   | 32      | 100       | 0.98               | 0.99      | 110.89 | 6529       
| 2.580   | 0.81   |
   | 32      | 200       | 0.99               | 0.99      | 108.97 | 313        
| 0.142   | 0.80   |
   | 32      | 200       | 0.98               | 0.99      | 110.55 | 7145       
| 2.861   | 0.81   |
   | 64      | 100       | 0.99               | 0.99      | 133.61 | 314        
| 0.152   | 0.98   |
   | 64      | 100       | 0.98               | 0.99      | 135.74 | 7033       
| 2.765   | 0.99   |
   | 64      | 200       | 0.99               | 0.99      | 133.84 | 333        
| 0.163   | 0.98   |
   | 64      | 200       | 0.98               | 0.99      | 135.96 | 7833       
| 3.121   | 1.00   |
   
   - `mean` is the average number of results above the `threshold`
   - `numVisited` is the average number of HNSW nodes visited per-query
   - The latency is measured in `ms` per-query
   
   **IF** the goal is to "get all vectors within a radius", then looks like 
using the new radius-based search API scales better than having a large `topK` 
and post-filtering results later?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to