benwtrent commented on code in PR #11946: URL: https://github.com/apache/lucene/pull/11946#discussion_r1049904819
########## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ########## @@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) { * @throws IllegalArgumentException if <code>k</code> is less than 1 */ public KnnVectorQuery(String field, float[] target, int k, Query filter) { + this(field, target, k, Float.NEGATIVE_INFINITY, filter); + } + + /** + * Find the <code>k</code> nearest documents to the target vector according to the vectors in the + * given field. <code>target</code> vector. + * + * @param field a field that has been indexed as a {@link KnnVectorField}. + * @param target the target of the search + * @param k the number of documents to find (the upper bound) + * @param similarityThreshold the minimum acceptable value of similarity Review Comment: Tl;dr Thank you for bearing with me! I think this is a good change. I would be happy with the JavaDocs, etc. clearly indicating that this threshold relates to the un-boosted vector score, not the raw similarity calculation. Dot-product, cosine, and euclidean are well defined concepts outside of Lucene. Lucene mangles (for undoubtably good reasons) the output of these similarities in undocumented ways to fit within boundaries. > with the current CR, I don't know what `CR` means. Change request? > However, this is similar to how result scores are treated elsewhere in Lucene - their value ranges are not well-defined; Agreed, ranges are usually predicated on term statistics, etc. and can potentially be considered "unbounded" as the corpus changes. However, does Lucene require that all unboosted BM25 scores are between 0-1? It does seem like an "arbitrary" decision (to me, I don't know the full-breadth of Lucene optimizations, etc. when it comes to scores) to restrict vector similarity in this way. But that is a broader conversation. I have some learning to do. > I guess practically speaking, as a user, I think I am going to have to do empirical work to know what threshold to use; these are not likely going to be motivated by some a priori knowledge of what a "good" dot-product is I would argue that a user could have a priori knowledge here. Think of it in the use case when the user knows their model used to make the vectors. At that point, they 100% know what is considered relevant based on their loss function and training + test data. Choosing a dot-product or cosine threshold that fits within 90% percentile or something given their test data results. I agree that this would be different if users were using an "off the shelf" model. In that case, they would probably require hybrid-search and combining with BM25 to get anything like relevant results (boosting various queries accordingly). Thus, learning what settings are required in an unfiltered case. > if we were to switch to using vector similarities that would correspond more directly to the underlying functions, we would have to clearly define them Cosine, dot-product, euclidean, are all already well defined. The functions to calculate them are universally recognized. Where Lucene separates itself is the manipulation of the similarity output to fit into a range [0, 1]. I guess this is cost of doing business in Lucene. I am not suggesting that all scoring of vector document searches changes. Simply that "similarity" and "score" are related, but are different things. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org