benwtrent commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2698684914
@lpld here is my Lucene util changes: https://github.com/mikemccand/luceneutil/pull/348 > What exactly do the numbers in the description of this pull request mean? When you say that the recall for Cohere 768 is 0.938, is it the absolute recall value that you got from the benchmark, or is it some sort of ratio between the quantized and non-quantized recalls? Its recall with oversampling. So, gathering the nearest 50 vectors for calculating the nearest 10 vectors. In a production environment, there would be a rescore phase between the initial gathering and providing the final 10 docs to the consumer. > Do you have any ideas about what could be the reason for such a huge recall difference in the benchmark results on difference environments? Additionally, I have found that I periodically get bitten by Lucene util trying to be helpful and storing the true nearest neighbor files. If you reindex your data, the `nn-*` files should be deleted as the doc ids changed. Though, those `nn-*` files containing the true nearest neighbors are very useful when adjusting query time parameters against he same index. I think lucene util can be made better here when it comes to getting repeatable recall from previously stored nearest neighbors. The tricky part is making sure whatever methodology we use works with all the different test kinds lucene util does :/ > I was also trying to do some benchmarking with other public datasets (without luceneutil), and I got a little confused about how to correctly calculate the recall. I understand that recall is a ratio between the number of correct responses and the total number of responses. The total number of responses is straightforward, but the number of correct ones is a bit confusing to me. Lucene util utilizes function score queries, which I think is the best way. ``` var queryVector = new ConstKnnFloatValueSource(query); var docVectors = new FloatKnnVectorFieldSource(KNN_FIELD); var query = new BooleanQuery.Builder() .add(new FunctionQuery(new FloatVectorSimilarityFunction(similarityFunction, queryVector, docVectors)), BooleanClause.Occur.SHOULD) .add(filterQuery, BooleanClause.Occur.FILTER) .build(); ``` This will do a raw floating point vector comparison against every vector that passes the filterQuery. That will give you the true nearest neighbors for a given `queryVector` and configured `similarityFunction`. > However, in lucene unit tests a different query is used to get the correct neighbors from the index: I think this might be a bug actually. It will actually due a full scan, but use the quantized vector scorers. What you actually want is the raw vector scores compared directly. Then you simply compare set overlap given the doc ids (or some stored id field) when queried against the index. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org