kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1766995337
### Benchmarks Using the vector file from https://home.apache.org/~sokolov/enwiki-20120502-lines-1k-100d.vec (enwiki dataset, unit vectors, 100 dimensions) The setup was 1M doc vectors in a single HNSW graph with `DOT_PRODUCT` similarity, and 10K query vectors The baseline for the new objective is "all vectors above a score threshold" (as opposed to the best-scoring `topK` vectors in the current system) for a given query and is used to compute recall in all subsequent runs.. Here are some statistics for the result counts in the new baseline: | threshold | mean | stdDev | min | p25 | p50 | p75 | p90 | p99 | max | | --------- | -------- | --------- | --- | --- | ---- | ------ | ------ | ------ | ------ | | 0.95 | 71877.73 | 109177.23 | 0 | 222 | 7436 | 116567 | 259135 | 388113 | 483330 | | 0.96 | 32155.63 | 57183.83 | 0 | 30 | 3524 | 36143 | 120700 | 235038 | 342959 | | 0.97 | 8865.48 | 19006.24 | 0 | 1 | 816 | 5483 | 29966 | 92433 | 174163 | | 0.98 | 1010.10 | 2423.03 | 0 | 0 | 46 | 873 | 3234 | 12175 | 40163 | | 0.99 | 136.47 | 465.91 | 0 | 0 | 0 | 2 | 77 | 2296 | 2494 | This is used to get an estimate of query - result count distribution for various `threshold` values, and also gauge the corresponding `topK` to use for comparison with the new radius-based vector search API Here we will benchmark the new API against a high `topK` (+ filtering out results below the threshold after HNSW search) ### K-NN Search (current system) | maxConn | beamWidth | topK | threshold | mean | numVisited | latency | recall | | ------- | --------- | ---- | --------- | ------ | ---------- | ------- | ------ | | 16 | 100 | 500 | 0.99 | 46.39 | 4086 | 1.465 | 0.34 | | 16 | 100 | 1000 | 0.99 | 83.92 | 6890 | 2.600 | 0.61 | | 16 | 100 | 2000 | 0.99 | 129.56 | 11727 | 4.746 | 0.95 | | 16 | 200 | 500 | 0.99 | 46.39 | 4504 | 1.535 | 0.34 | | 16 | 200 | 1000 | 0.99 | 83.92 | 7564 | 2.759 | 0.61 | | 16 | 200 | 2000 | 0.99 | 129.56 | 12805 | 5.007 | 0.95 | | 32 | 100 | 500 | 0.99 | 46.39 | 4940 | 1.644 | 0.34 | | 32 | 100 | 1000 | 0.99 | 83.92 | 8271 | 2.944 | 0.61 | | 32 | 100 | 2000 | 0.99 | 129.56 | 13937 | 5.335 | 0.95 | | 32 | 200 | 500 | 0.99 | 46.39 | 5654 | 1.890 | 0.34 | | 32 | 200 | 1000 | 0.99 | 83.92 | 9401 | 3.320 | 0.61 | | 32 | 200 | 2000 | 0.99 | 129.56 | 15707 | 5.987 | 0.95 | | 64 | 100 | 500 | 0.99 | 46.39 | 5241 | 1.736 | 0.34 | | 64 | 100 | 1000 | 0.99 | 83.92 | 8766 | 3.091 | 0.61 | | 64 | 100 | 2000 | 0.99 | 129.56 | 14736 | 5.567 | 0.95 | | 64 | 200 | 500 | 0.99 | 46.39 | 6095 | 1.992 | 0.34 | | 64 | 200 | 1000 | 0.99 | 83.92 | 10119 | 3.535 | 0.61 | | 64 | 200 | 2000 | 0.99 | 129.56 | 16852 | 6.365 | 0.95 | ### R-NN Search (new system) | maxConn | beamWidth | traversalThreshold | threshold | mean | numVisited | latency | recall | | ------- | --------- | ------------------ | --------- | ------ | ---------- | ------- | ------ | | 16 | 100 | 0.99 | 0.99 | 94.03 | 256 | 0.129 | 0.69 | | 16 | 100 | 0.98 | 0.99 | 95.18 | 5171 | 2.062 | 0.70 | | 16 | 200 | 0.99 | 0.99 | 89.96 | 263 | 0.119 | 0.66 | | 16 | 200 | 0.98 | 0.99 | 91.09 | 5497 | 2.207 | 0.67 | | 32 | 100 | 0.99 | 0.99 | 109.17 | 295 | 0.135 | 0.80 | | 32 | 100 | 0.98 | 0.99 | 110.89 | 6529 | 2.580 | 0.81 | | 32 | 200 | 0.99 | 0.99 | 108.97 | 313 | 0.142 | 0.80 | | 32 | 200 | 0.98 | 0.99 | 110.55 | 7145 | 2.861 | 0.81 | | 64 | 100 | 0.99 | 0.99 | 133.61 | 314 | 0.152 | 0.98 | | 64 | 100 | 0.98 | 0.99 | 135.74 | 7033 | 2.765 | 0.99 | | 64 | 200 | 0.99 | 0.99 | 133.84 | 333 | 0.163 | 0.98 | | 64 | 200 | 0.98 | 0.99 | 135.96 | 7833 | 3.121 | 1.00 | - `mean` is the average number of results above the `threshold` - `numVisited` is the average number of HNSW nodes visited per-query - The latency is measured in `ms` per-query **IF** the goal is to "get all vectors within a radius", then looks like using the new radius-based search API scales better than having a large `topK` and post-filtering results later? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org