Re: [PR] Add support for similarity-based vector searches [lucene]

via GitHub Tue, 14 Nov 2023 00:30:29 -0800


kaivalnp commented on PR #12679:
URL: https://github.com/apache/lucene/pull/12679#issuecomment-1809743395


   ### Benchmark Setup
   
   Sharing my benchmark setup for reproducibility in [this 
branch](https://github.com/kaivalnp/lucene/tree/similarity-benchmark) (see 
[this 
commit](https://github.com/apache/lucene/commit/41610c734308505c87018f2da3343a489a7f7a59)
 on top of changes in the PR)
   
   *Consideration*: Being a similarity-based search, we need not maintain the 
actual results above a `resultSimilarity` (because any hit above this threshold 
is implicitly correct, and in the "golden set / baseline"). This makes recall 
computations easier: simply maintain per-query average count of results above 
the `resultSimilarity`, and a ratio of candidate / baseline gives the "recall"
   
   Added a utility class with some command line arguments:
   - `--vecPath`: path to VEC file with vectors arranged one after another, in 
`LITTLE_ENDIAN` byte order (example 
https://home.apache.org/~sokolov/enwiki-20120502-lines-1k-100d.vec)
   - `--dim`: dimension of vectors, default `100` from above file
   - `--numDocs`: number of vectors to index, default `1M`
   - `--numQueries`: number of vectors to search for, default `10K`
   - `--indexPath`: a parent directory for storing multiple indexes (of 
`maxConn` - `beamWidth` combinations. Also takes `numDocs`, `knnField`, 
`function` and `maxNumSegments` into account for naming)
   - `--knnField`: name of KNN field, default `knn`
   - `--function`: `VectorSimilarityFunction` to use (one of `[EUCLIDEAN, 
DOT_PRODUCT, COSINE, MAXIMUM_INNER_PRODUCT]`), default `DOT_PRODUCT`
   - `--maxNumSegments`: number of segments to force merge into, default `1` to 
mitigate noise of parallelism
   - `--maxConns`: list of `maxConn` values to try, default `{16, 32, 64}`
   - `--beamWidths`: list of `beamWidth` values to try, default `{100, 200}`
   - `--topKs`: list of `topK` values to try in the KNN-based search, default 
to none
   - `--topK-thresholds`: list of `threshold` values to post-filter in the 
KNN-based search, default to none. 1-1 correspondence with `topKs` parameter 
(must be of same size)
   - `--traversalSimilarities`: list of `traversalSimilarity` values to try in 
the similarity-based search, default to none
   - `--resultSimilarities`: list of `resultSimilarity` values to try in the 
similarity-based search, default to none. 1-1 correspondence with 
`traversalSimilarities` parameter (must be of same size)
   
   The script takes the first `numDocs` vectors and creates multiple indexes in 
the `indexPath` parent directory with 
`dim-numDocs-knnField-function-maxNumSegments-maxConn-beamWidth` folder name. 
It takes the next `numQueries` vectors as queries
   
   It goes over all `topKs` - `topK-thresholds` and performs KNN-based search, 
also computing and caching the "true" results for recall (to be reused from 
similarity-based search). Uses a `KnnFloatVectorQuery`, but removes results 
below the `topK-threshold` after graph search. The idea is to simulate a high 
`topK` + post-filtering
   
   It then goes over all `traversalSimilarities` - `resultSimilarities` and 
performs similarity-based search, re-using or computing the "true" results for 
recall
   
   Finally prints both results in a formatted table (shown below)
   
   The script can be executed using `./gradlew 
:lucene:core:similarity-benchmark`. It makes use of parallelism wherever 
possible to speed things up (each query on a different thread) - *please set 
the desired concurrency based on your machine*. The latency reporting takes 
multi-threading into account and averages times for each search in a single 
thread (even when multiple are running in parallel and wall-clock time may be 
lower)
   
   Here is a sample command on my machine, and corresponding output (took \~30 
min *after all indexes were created*):
   
   ```sh
   ./gradlew :lucene:core:similarity-benchmark 
-Djava.util.concurrent.ForkJoinPool.common.parallelism=64 
--args="--vecPath=/home/kaivalnp/working/similarity-benchmark/docs.vec 
--indexPath=/home/kaivalnp/working/similarity-benchmark/indexes 
--topKs=500,1000,2000 --topK-thresholds=0.99,0.99,0.99 
--traversalSimilarities=0.98,0.99 --resultSimilarities=0.99,0.99"
   ```
   
   ### KNN search
   
   | maxConn | beamWidth | topK | threshold | count  | numVisited | latency | 
recall |
   | ------- | --------- | ---- | --------- | ------ | ---------- | ------- | 
------ |
   | 16      | 100       | 500  | 0.99      | 46.39  | 4086.90    | 2.58    | 
0.34   |
   | 16      | 100       | 1000 | 0.99      | 83.92  | 6890.43    | 4.50    | 
0.61   |
   | 16      | 100       | 2000 | 0.99      | 129.56 | 11727.72   | 8.09    | 
0.95   |
   | 16      | 200       | 500  | 0.99      | 46.39  | 4504.86    | 2.72    | 
0.34   |
   | 16      | 200       | 1000 | 0.99      | 83.92  | 7564.04    | 4.89    | 
0.61   |
   | 16      | 200       | 2000 | 0.99      | 129.56 | 12805.53   | 8.86    | 
0.95   |
   | 32      | 100       | 500  | 0.99      | 46.39  | 4940.79    | 2.94    | 
0.34   |
   | 32      | 100       | 1000 | 0.99      | 83.92  | 8271.67    | 5.26    | 
0.61   |
   | 32      | 100       | 2000 | 0.99      | 129.56 | 13937.63   | 9.40    | 
0.95   |
   | 32      | 200       | 500  | 0.99      | 46.39  | 5654.39    | 3.35    | 
0.34   |
   | 32      | 200       | 1000 | 0.99      | 83.92  | 9401.15    | 6.02    | 
0.61   |
   | 32      | 200       | 2000 | 0.99      | 129.56 | 15707.08   | 10.71   | 
0.95   |
   | 64      | 100       | 500  | 0.99      | 46.39  | 5241.29    | 3.09    | 
0.34   |
   | 64      | 100       | 1000 | 0.99      | 83.92  | 8766.76    | 5.53    | 
0.61   |
   | 64      | 100       | 2000 | 0.99      | 129.56 | 14736.85   | 9.86    | 
0.95   |
   | 64      | 200       | 500  | 0.99      | 46.40  | 6095.06    | 3.57    | 
0.34   |
   | 64      | 200       | 1000 | 0.99      | 83.92  | 10119.29   | 6.36    | 
0.61   |
   | 64      | 200       | 2000 | 0.99      | 129.56 | 16852.31   | 11.27   | 
0.95   |
   
   ### Similarity-based search
   
   | maxConn | beamWidth | traversalSimilarity | resultSimilarity | count  | 
numVisited | latency | recall |
   | ------- | --------- | ------------------- | ---------------- | ------ | 
---------- | ------- | ------ |
   | 16      | 100       | 0.98                | 0.99             | 95.18  | 
5171.94    | 3.62    | 0.70   |
   | 16      | 100       | 0.99                | 0.99             | 94.03  | 
256.82     | 0.22    | 0.69   |
   | 16      | 200       | 0.98                | 0.99             | 91.09  | 
5497.00    | 3.88    | 0.67   |
   | 16      | 200       | 0.99                | 0.99             | 89.96  | 
263.99     | 0.23    | 0.66   |
   | 32      | 100       | 0.98                | 0.99             | 110.89 | 
6529.97    | 4.52    | 0.81   |
   | 32      | 100       | 0.99                | 0.99             | 109.17 | 
295.06     | 0.26    | 0.80   |
   | 32      | 200       | 0.98                | 0.99             | 110.55 | 
7145.58    | 5.04    | 0.81   |
   | 32      | 200       | 0.99                | 0.99             | 108.97 | 
313.19     | 0.27    | 0.80   |
   | 64      | 100       | 0.98                | 0.99             | 135.74 | 
7033.33    | 4.87    | 0.99   |
   | 64      | 100       | 0.99                | 0.99             | 133.61 | 
314.55     | 0.30    | 0.98   |
   | 64      | 200       | 0.98                | 0.99             | 135.96 | 
7833.15    | 5.52    | 1.00   |
   | 64      | 200       | 0.99                | 0.99             | 133.84 | 
333.15     | 0.31    | 0.98   |
   
   The `count`, `numVisited`, `recall` are exactly same as before. There is a 
difference in baseline of `latency` because of running a different machine, but 
the ratio is about the same..
   
   Please let me know if you're able to reproduce it / run it on a different 
vector file @benwtrent..
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for similarity-based vector searches [lucene]

Reply via email to