kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1809743395
### Benchmark Setup Sharing my benchmark setup for reproducibility in [this branch](https://github.com/kaivalnp/lucene/tree/similarity-benchmark) (see [this commit](https://github.com/apache/lucene/commit/41610c734308505c87018f2da3343a489a7f7a59) on top of changes in the PR) *Consideration*: Being a similarity-based search, we need not maintain the actual results above a `resultSimilarity` (because any hit above this threshold is implicitly correct, and in the "golden set / baseline"). This makes recall computations easier: simply maintain per-query average count of results above the `resultSimilarity`, and a ratio of candidate / baseline gives the "recall" Added a utility class with some command line arguments: - `--vecPath`: path to VEC file with vectors arranged one after another, in `LITTLE_ENDIAN` byte order (example https://home.apache.org/~sokolov/enwiki-20120502-lines-1k-100d.vec) - `--dim`: dimension of vectors, default `100` from above file - `--numDocs`: number of vectors to index, default `1M` - `--numQueries`: number of vectors to search for, default `10K` - `--indexPath`: a parent directory for storing multiple indexes (of `maxConn` - `beamWidth` combinations. Also takes `numDocs`, `knnField`, `function` and `maxNumSegments` into account for naming) - `--knnField`: name of KNN field, default `knn` - `--function`: `VectorSimilarityFunction` to use (one of `[EUCLIDEAN, DOT_PRODUCT, COSINE, MAXIMUM_INNER_PRODUCT]`), default `DOT_PRODUCT` - `--maxNumSegments`: number of segments to force merge into, default `1` to mitigate noise of parallelism - `--maxConns`: list of `maxConn` values to try, default `{16, 32, 64}` - `--beamWidths`: list of `beamWidth` values to try, default `{100, 200}` - `--topKs`: list of `topK` values to try in the KNN-based search, default to none - `--topK-thresholds`: list of `threshold` values to post-filter in the KNN-based search, default to none. 1-1 correspondence with `topKs` parameter (must be of same size) - `--traversalSimilarities`: list of `traversalSimilarity` values to try in the similarity-based search, default to none - `--resultSimilarities`: list of `resultSimilarity` values to try in the similarity-based search, default to none. 1-1 correspondence with `traversalSimilarities` parameter (must be of same size) The script takes the first `numDocs` vectors and creates multiple indexes in the `indexPath` parent directory with `dim-numDocs-knnField-function-maxNumSegments-maxConn-beamWidth` folder name. It takes the next `numQueries` vectors as queries It goes over all `topKs` - `topK-thresholds` and performs KNN-based search, also computing and caching the "true" results for recall (to be reused from similarity-based search). Uses a `KnnFloatVectorQuery`, but removes results below the `topK-threshold` after graph search. The idea is to simulate a high `topK` + post-filtering It then goes over all `traversalSimilarities` - `resultSimilarities` and performs similarity-based search, re-using or computing the "true" results for recall Finally prints both results in a formatted table (shown below) The script can be executed using `./gradlew :lucene:core:similarity-benchmark`. It makes use of parallelism wherever possible to speed things up (each query on a different thread) - *please set the desired concurrency based on your machine*. The latency reporting takes multi-threading into account and averages times for each search in a single thread (even when multiple are running in parallel and wall-clock time may be lower) Here is a sample command on my machine, and corresponding output (took \~30 min *after all indexes were created*): ```sh ./gradlew :lucene:core:similarity-benchmark -Djava.util.concurrent.ForkJoinPool.common.parallelism=64 --args="--vecPath=/home/kaivalnp/working/similarity-benchmark/docs.vec --indexPath=/home/kaivalnp/working/similarity-benchmark/indexes --topKs=500,1000,2000 --topK-thresholds=0.99,0.99,0.99 --traversalSimilarities=0.98,0.99 --resultSimilarities=0.99,0.99" ``` ### KNN search | maxConn | beamWidth | topK | threshold | count | numVisited | latency | recall | | ------- | --------- | ---- | --------- | ------ | ---------- | ------- | ------ | | 16 | 100 | 500 | 0.99 | 46.39 | 4086.90 | 2.58 | 0.34 | | 16 | 100 | 1000 | 0.99 | 83.92 | 6890.43 | 4.50 | 0.61 | | 16 | 100 | 2000 | 0.99 | 129.56 | 11727.72 | 8.09 | 0.95 | | 16 | 200 | 500 | 0.99 | 46.39 | 4504.86 | 2.72 | 0.34 | | 16 | 200 | 1000 | 0.99 | 83.92 | 7564.04 | 4.89 | 0.61 | | 16 | 200 | 2000 | 0.99 | 129.56 | 12805.53 | 8.86 | 0.95 | | 32 | 100 | 500 | 0.99 | 46.39 | 4940.79 | 2.94 | 0.34 | | 32 | 100 | 1000 | 0.99 | 83.92 | 8271.67 | 5.26 | 0.61 | | 32 | 100 | 2000 | 0.99 | 129.56 | 13937.63 | 9.40 | 0.95 | | 32 | 200 | 500 | 0.99 | 46.39 | 5654.39 | 3.35 | 0.34 | | 32 | 200 | 1000 | 0.99 | 83.92 | 9401.15 | 6.02 | 0.61 | | 32 | 200 | 2000 | 0.99 | 129.56 | 15707.08 | 10.71 | 0.95 | | 64 | 100 | 500 | 0.99 | 46.39 | 5241.29 | 3.09 | 0.34 | | 64 | 100 | 1000 | 0.99 | 83.92 | 8766.76 | 5.53 | 0.61 | | 64 | 100 | 2000 | 0.99 | 129.56 | 14736.85 | 9.86 | 0.95 | | 64 | 200 | 500 | 0.99 | 46.40 | 6095.06 | 3.57 | 0.34 | | 64 | 200 | 1000 | 0.99 | 83.92 | 10119.29 | 6.36 | 0.61 | | 64 | 200 | 2000 | 0.99 | 129.56 | 16852.31 | 11.27 | 0.95 | ### Similarity-based search | maxConn | beamWidth | traversalSimilarity | resultSimilarity | count | numVisited | latency | recall | | ------- | --------- | ------------------- | ---------------- | ------ | ---------- | ------- | ------ | | 16 | 100 | 0.98 | 0.99 | 95.18 | 5171.94 | 3.62 | 0.70 | | 16 | 100 | 0.99 | 0.99 | 94.03 | 256.82 | 0.22 | 0.69 | | 16 | 200 | 0.98 | 0.99 | 91.09 | 5497.00 | 3.88 | 0.67 | | 16 | 200 | 0.99 | 0.99 | 89.96 | 263.99 | 0.23 | 0.66 | | 32 | 100 | 0.98 | 0.99 | 110.89 | 6529.97 | 4.52 | 0.81 | | 32 | 100 | 0.99 | 0.99 | 109.17 | 295.06 | 0.26 | 0.80 | | 32 | 200 | 0.98 | 0.99 | 110.55 | 7145.58 | 5.04 | 0.81 | | 32 | 200 | 0.99 | 0.99 | 108.97 | 313.19 | 0.27 | 0.80 | | 64 | 100 | 0.98 | 0.99 | 135.74 | 7033.33 | 4.87 | 0.99 | | 64 | 100 | 0.99 | 0.99 | 133.61 | 314.55 | 0.30 | 0.98 | | 64 | 200 | 0.98 | 0.99 | 135.96 | 7833.15 | 5.52 | 1.00 | | 64 | 200 | 0.99 | 0.99 | 133.84 | 333.15 | 0.31 | 0.98 | The `count`, `numVisited`, `recall` are exactly same as before. There is a difference in baseline of `latency` because of running a different machine, but the ratio is about the same.. Please let me know if you're able to reproduce it / run it on a different vector file @benwtrent.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org