Re: [PR] De-dup raw vectors? [lucene]

via GitHub Wed, 19 Nov 2025 20:24:37 -0800


kaivalnp commented on PR #15440:
URL: https://github.com/apache/lucene/pull/15440#issuecomment-3555702820


   ### Notes
   
   - Right now this is a crude implementation, rough and inefficient, only for 
demonstration purposes!
   - Basically a copy of `Lucene99FlatVectors*`, except that raw vectors are 
de-duped and written according to the new layout^ during `flush`
        - Additionally, an `ord -> position of vector` mapping is stored and 
used during searching
   - Does not support an index sort yet
   - Does not support merging yet
        - This is mainly an API challenge, because vector merging is expected 
to be field-by-field -- but seems doable with a new `finishMerge` API that does 
the equivalent of `flush`?
   
   ### Benchmark
   
   In order to index everything in a single segment, I had to:
   - Set [number of indexing 
threads](https://github.com/mikemccand/luceneutil/blob/2b3811fec8e334366d35bffe8d9f8914f104933b/src/python/knnPerfTest.py#L246-L247)
 to 1
   - Increase the [writer 
buffer](https://github.com/mikemccand/luceneutil/blob/2b3811fec8e334366d35bffe8d9f8914f104933b/src/main/knn/KnnIndexer.java#L59)
 to be sufficiently high for all vectors
   
   Made use of the option added in 
https://github.com/mikemccand/luceneutil/pull/468 (`filterStrategy`) -- which 
creates and searches a separate KNN field with a subset of documents (with 
`index-time-filter`)
   
   Cohere vectors, 768d, `MAXIMUM_INNER_PRODUCT` similarity
   
   `main`
   
   ```
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  
num_segments  index_size(MB)          filterStrategy  filterSelectivity  
vec_disk(MB)  vec_RAM(MB)  indexType
    0.899        3.505   3.497        0.998  100000   100      50       32      
  250         no     6995    157.61        634.49            0.01             1 
         299.24   query-time-pre-filter               0.50       292.969      
292.969       HNSW
    0.915        3.266   3.258        0.998  100000   100      50       32      
  250         no     5742      0.00      Infinity            0.11             1 
         299.24   query-time-pre-filter               0.20       292.969      
292.969       HNSW
    0.903        2.351   2.343        0.997  100000   100      50       32      
  250         no     3657      0.00      Infinity            0.10             1 
         299.24   query-time-pre-filter               0.10       292.969      
292.969       HNSW
    1.000        0.357   0.349        0.978  100000   100      50       32      
  250         no     1039      0.00      Infinity            0.10             1 
         299.24   query-time-pre-filter               0.01       292.969      
292.969       HNSW
    0.498        1.185   1.178        0.994  100000   100      50       32      
  250         no     3986      0.00      Infinity            0.10             1 
         299.24  query-time-post-filter               0.50       292.969      
292.969       HNSW
    0.202        1.165   1.157        0.993  100000   100      50       32      
  250         no     3986      0.00      Infinity            0.10             1 
         299.24  query-time-post-filter               0.20       292.969      
292.969       HNSW
    0.100        1.230   1.222        0.993  100000   100      50       32      
  250         no     3986      0.00      Infinity            0.11             1 
         299.24  query-time-post-filter               0.10       292.969      
292.969       HNSW
    0.010        1.181   1.173        0.993  100000   100      50       32      
  250         no     3986      0.00      Infinity            0.10             1 
         299.24  query-time-post-filter               0.01       292.969      
292.969       HNSW
    0.940        1.065   1.057        0.992  100000   100      50       32      
  250         no     3939    258.24        387.23            0.01             1 
         449.10       index-time-filter               0.50       292.969      
292.969       HNSW
    0.961        0.913   0.906        0.992  100000   100      50       32      
  250         no     3568    196.23        509.62            0.01             1 
         359.42       index-time-filter               0.20       292.969      
292.969       HNSW
    0.976        0.679   0.671        0.988  100000   100      50       32      
  250         no     3172    167.67        596.42            0.01             1 
         329.38       index-time-filter               0.10       292.969      
292.969       HNSW
    1.000        0.160   0.152        0.950  100000   100      50       32      
  250         no      984    155.23        644.19            0.01             1 
         302.16       index-time-filter               0.01       292.969      
292.969       HNSW
   ```
   
   
   This PR
   
   ```
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  
num_segments  index_size(MB)          filterStrategy  filterSelectivity  
vec_disk(MB)  vec_RAM(MB)  indexType
    0.904        3.705   3.697        0.998  100000   100      50       32      
  250         no     6982    159.17        628.28            0.01             1 
         300.00   query-time-pre-filter               0.50       292.969      
292.969       HNSW
    0.917        3.455   3.448        0.998  100000   100      50       32      
  250         no     5786      0.00      Infinity            0.10             1 
         300.00   query-time-pre-filter               0.20       292.969      
292.969       HNSW
    0.900        2.437   2.430        0.997  100000   100      50       32      
  250         no     3553      0.00      Infinity            0.10             1 
         300.00   query-time-pre-filter               0.10       292.969      
292.969       HNSW
    1.000        0.366   0.358        0.978  100000   100      50       32      
  250         no     1023      0.00      Infinity            0.10             1 
         300.00   query-time-pre-filter               0.01       292.969      
292.969       HNSW
    0.506        1.263   1.255        0.994  100000   100      50       32      
  250         no     3986      0.00      Infinity            0.10             1 
         300.00  query-time-post-filter               0.50       292.969      
292.969       HNSW
    0.206        1.255   1.247        0.994  100000   100      50       32      
  250         no     3986      0.00      Infinity            0.10             1 
         300.00  query-time-post-filter               0.20       292.969      
292.969       HNSW
    0.100        1.257   1.249        0.994  100000   100      50       32      
  250         no     3986      0.00      Infinity            0.10             1 
         300.00  query-time-post-filter               0.10       292.969      
292.969       HNSW
    0.010        1.287   1.279        0.994  100000   100      50       32      
  250         no     3986      0.00      Infinity            0.10             1 
         300.00  query-time-post-filter               0.01       292.969      
292.969       HNSW
    0.940        1.138   1.130        0.993  100000   100      50       32      
  250         no     3927    249.90        400.17            0.01             1 
         303.57       index-time-filter               0.50       292.969      
292.969       HNSW
    0.963        1.001   0.993        0.992  100000   100      50       32      
  250         no     3598    188.64        530.11            0.01             1 
         301.41       index-time-filter               0.20       292.969      
292.969       HNSW
    0.977        0.791   0.783        0.990  100000   100      50       32      
  250         no     3159    168.33        594.09            0.01             1 
         300.65       index-time-filter               0.10       292.969      
292.969       HNSW
    1.000        0.209   0.201        0.962  100000   100      50       32      
  250         no     1023    155.47        643.22            0.01             1 
         300.05       index-time-filter               0.01       292.969      
292.969       HNSW
   ```
   
   Note the reduction in `index_size(MB)` (when `index-time-filter` is used) 
due to re-use of raw vectors!
   There is a slight increase in latency with this PR, presumably because of 
the extra lookup step of the vector position..
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] De-dup raw vectors? [lucene]

Reply via email to