Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

via GitHub Tue, 04 Mar 2025 11:23:00 -0800


benwtrent commented on PR #14078:
URL: https://github.com/apache/lucene/pull/14078#issuecomment-2698684914


   @lpld here is my Lucene util changes: 
https://github.com/mikemccand/luceneutil/pull/348
   
   > What exactly do the numbers in the description of this pull request mean? 
When you say that the recall for Cohere 768 is 0.938, is it the absolute recall 
value that you got from the benchmark, or is it some sort of ratio between the 
quantized and non-quantized recalls?
   
   Its recall with oversampling. So, gathering the nearest 50 vectors for 
calculating the nearest 10 vectors. In a production environment, there would be 
a rescore phase between the initial gathering and providing the final 10 docs 
to the consumer.
   
   > Do you have any ideas about what could be the reason for such a huge 
recall difference in the benchmark results on difference environments?
   
   Additionally, I have found that I periodically get bitten by Lucene util 
trying to be helpful and storing the true nearest neighbor files. If you 
reindex your data, the `nn-*` files should be deleted as the doc ids changed. 
Though, those `nn-*` files containing the true nearest neighbors are very 
useful when adjusting query time parameters against he same index.
   
   I think lucene util can be made better here when it comes to getting 
repeatable recall from previously stored nearest neighbors. The tricky part is 
making sure whatever methodology we use works with all the different test kinds 
lucene util does :/
   
   > I was also trying to do some benchmarking with other public datasets 
(without luceneutil), and I got a little confused about how to correctly 
calculate the recall. I understand that recall is a ratio between the number of 
correct responses and the total number of responses. The total number of 
responses is straightforward, but the number of correct ones is a bit confusing 
to me. 
   
   Lucene util utilizes function score queries, which I think is the best way.
   ```
   var queryVector = new ConstKnnFloatValueSource(query);
   var docVectors = new FloatKnnVectorFieldSource(KNN_FIELD);
   var query = new BooleanQuery.Builder()
       .add(new FunctionQuery(new 
FloatVectorSimilarityFunction(similarityFunction, queryVector, docVectors)), 
BooleanClause.Occur.SHOULD)
       .add(filterQuery, BooleanClause.Occur.FILTER)
       .build();
   ```
   
   This will do a raw floating point vector comparison against every vector 
that passes the filterQuery. That will give you the true nearest neighbors for 
a given `queryVector` and configured `similarityFunction`.
   
   > However, in lucene unit tests a different query is used to get the correct 
neighbors from the index:
   
   I think this might be a bug actually. It will actually due a full scan, but 
use the quantized vector scorers. What you actually want is the raw vector 
scores compared directly.
   
   Then you simply compare set overlap given the doc ids (or some stored id 
field) when queried against the index.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Binary vector format for flat and hnsw vectors [lucene]

Reply via email to