[ https://issues.apache.org/jira/browse/LUCENE-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424680#comment-17424680 ]
Michael Sokolov commented on LUCENE-10147: ------------------------------------------ Also, I'll just note that we do have this javadoc about the requirement for unit-length vectors: {{ /** * Creates a numeric vector field. Fields are single-valued: each document has either one value or * no value. Vectors of a single field share the same dimension and similarity function. Note that * some strategies (notably dot-product) require values to be unit-length, which can be enforced * using VectorUtil.l2Normalize(float[]). * * @param name field name * @param vector value * @param similarityFunction a function defining vector proximity. * @throws IllegalArgumentException if any parameter is null, or the vector is empty or has * dimension > 1024. */ public KnnVectorField(String name, float[] vector, VectorSimilarityFunction similarityFunction) {}} I guess it is pretty oblique though; I think we need more specific guidance, and something in the VectorSImilarity class too - does that make sense for docs? > KnnVectorQuery can produce negative scores > ------------------------------------------ > > Key: LUCENE-10147 > URL: https://issues.apache.org/jira/browse/LUCENE-10147 > Project: Lucene - Core > Issue Type: Bug > Reporter: Julie Tibshirani > Priority: Blocker > > The cosine similarity of two vectors falls in the range [-1, 1]. So currently > with cosine similarity, {{KnnVectorQuery}} can produce negative scores. Maybe > we should just adjust the scores in this case by adding 1, shifting them to > the range [0, 2]. > As a side note, this made me notice that > {{VectorSimilarityFunction.DOT_PRODUCT}} is really quite "expert"! Users need > to know to normalize all document and query vectors to unit length when using > this similarity. Otherwise the output is unbounded and difficult to handle in > scoring. Also dot product is not a true metric: for example, it doesn't obey > the triangle inequality. So many ANN algorithms have trouble supporting it. > As part of this issue, we could improve the documentation on > {{VectorSimilarityFunction.DOT_PRODUCT}} to clarify that normalization is > required. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org