kaivalnp commented on PR #15440:
URL: https://github.com/apache/lucene/pull/15440#issuecomment-3555702820
### Notes
- Right now this is a crude implementation, rough and inefficient, only for
demonstration purposes!
- Basically a copy of `Lucene99FlatVectors*`, except that raw vectors are
de-duped and written according to the new layout^ during `flush`
- Additionally, an `ord -> position of vector` mapping is stored and
used during searching
- Does not support an index sort yet
- Does not support merging yet
- This is mainly an API challenge, because vector merging is expected
to be field-by-field -- but seems doable with a new `finishMerge` API that does
the equivalent of `flush`?
### Benchmark
In order to index everything in a single segment, I had to:
- Set [number of indexing
threads](https://github.com/mikemccand/luceneutil/blob/2b3811fec8e334366d35bffe8d9f8914f104933b/src/python/knnPerfTest.py#L246-L247)
to 1
- Increase the [writer
buffer](https://github.com/mikemccand/luceneutil/blob/2b3811fec8e334366d35bffe8d9f8914f104933b/src/main/knn/KnnIndexer.java#L59)
to be sufficiently high for all vectors
Made use of the option added in
https://github.com/mikemccand/luceneutil/pull/468 (`filterStrategy`) -- which
creates and searches a separate KNN field with a subset of documents (with
`index-time-filter`)
Cohere vectors, 768d, `MAXIMUM_INNER_PRODUCT` similarity
`main`
```
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn
beamWidth quantized visited index(s) index_docs/s force_merge(s)
num_segments index_size(MB) filterStrategy filterSelectivity
vec_disk(MB) vec_RAM(MB) indexType
0.899 3.505 3.497 0.998 100000 100 50 32
250 no 6995 157.61 634.49 0.01 1
299.24 query-time-pre-filter 0.50 292.969
292.969 HNSW
0.915 3.266 3.258 0.998 100000 100 50 32
250 no 5742 0.00 Infinity 0.11 1
299.24 query-time-pre-filter 0.20 292.969
292.969 HNSW
0.903 2.351 2.343 0.997 100000 100 50 32
250 no 3657 0.00 Infinity 0.10 1
299.24 query-time-pre-filter 0.10 292.969
292.969 HNSW
1.000 0.357 0.349 0.978 100000 100 50 32
250 no 1039 0.00 Infinity 0.10 1
299.24 query-time-pre-filter 0.01 292.969
292.969 HNSW
0.498 1.185 1.178 0.994 100000 100 50 32
250 no 3986 0.00 Infinity 0.10 1
299.24 query-time-post-filter 0.50 292.969
292.969 HNSW
0.202 1.165 1.157 0.993 100000 100 50 32
250 no 3986 0.00 Infinity 0.10 1
299.24 query-time-post-filter 0.20 292.969
292.969 HNSW
0.100 1.230 1.222 0.993 100000 100 50 32
250 no 3986 0.00 Infinity 0.11 1
299.24 query-time-post-filter 0.10 292.969
292.969 HNSW
0.010 1.181 1.173 0.993 100000 100 50 32
250 no 3986 0.00 Infinity 0.10 1
299.24 query-time-post-filter 0.01 292.969
292.969 HNSW
0.940 1.065 1.057 0.992 100000 100 50 32
250 no 3939 258.24 387.23 0.01 1
449.10 index-time-filter 0.50 292.969
292.969 HNSW
0.961 0.913 0.906 0.992 100000 100 50 32
250 no 3568 196.23 509.62 0.01 1
359.42 index-time-filter 0.20 292.969
292.969 HNSW
0.976 0.679 0.671 0.988 100000 100 50 32
250 no 3172 167.67 596.42 0.01 1
329.38 index-time-filter 0.10 292.969
292.969 HNSW
1.000 0.160 0.152 0.950 100000 100 50 32
250 no 984 155.23 644.19 0.01 1
302.16 index-time-filter 0.01 292.969
292.969 HNSW
```
This PR
```
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn
beamWidth quantized visited index(s) index_docs/s force_merge(s)
num_segments index_size(MB) filterStrategy filterSelectivity
vec_disk(MB) vec_RAM(MB) indexType
0.904 3.705 3.697 0.998 100000 100 50 32
250 no 6982 159.17 628.28 0.01 1
300.00 query-time-pre-filter 0.50 292.969
292.969 HNSW
0.917 3.455 3.448 0.998 100000 100 50 32
250 no 5786 0.00 Infinity 0.10 1
300.00 query-time-pre-filter 0.20 292.969
292.969 HNSW
0.900 2.437 2.430 0.997 100000 100 50 32
250 no 3553 0.00 Infinity 0.10 1
300.00 query-time-pre-filter 0.10 292.969
292.969 HNSW
1.000 0.366 0.358 0.978 100000 100 50 32
250 no 1023 0.00 Infinity 0.10 1
300.00 query-time-pre-filter 0.01 292.969
292.969 HNSW
0.506 1.263 1.255 0.994 100000 100 50 32
250 no 3986 0.00 Infinity 0.10 1
300.00 query-time-post-filter 0.50 292.969
292.969 HNSW
0.206 1.255 1.247 0.994 100000 100 50 32
250 no 3986 0.00 Infinity 0.10 1
300.00 query-time-post-filter 0.20 292.969
292.969 HNSW
0.100 1.257 1.249 0.994 100000 100 50 32
250 no 3986 0.00 Infinity 0.10 1
300.00 query-time-post-filter 0.10 292.969
292.969 HNSW
0.010 1.287 1.279 0.994 100000 100 50 32
250 no 3986 0.00 Infinity 0.10 1
300.00 query-time-post-filter 0.01 292.969
292.969 HNSW
0.940 1.138 1.130 0.993 100000 100 50 32
250 no 3927 249.90 400.17 0.01 1
303.57 index-time-filter 0.50 292.969
292.969 HNSW
0.963 1.001 0.993 0.992 100000 100 50 32
250 no 3598 188.64 530.11 0.01 1
301.41 index-time-filter 0.20 292.969
292.969 HNSW
0.977 0.791 0.783 0.990 100000 100 50 32
250 no 3159 168.33 594.09 0.01 1
300.65 index-time-filter 0.10 292.969
292.969 HNSW
1.000 0.209 0.201 0.962 100000 100 50 32
250 no 1023 155.47 643.22 0.01 1
300.05 index-time-filter 0.01 292.969
292.969 HNSW
```
Note the reduction in `index_size(MB)` (when `index-time-filter` is used)
due to re-use of raw vectors!
There is a slight increase in latency with this PR, presumably because of
the extra lookup step of the vector position..
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]