benwtrent commented on code in PR #13308: URL: https://github.com/apache/lucene/pull/13308#discussion_r1575385344
########## lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestInt8HnswBackwardsCompatibility.java: ########## @@ -50,8 +51,14 @@ public class TestInt8HnswBackwardsCompatibility extends BackwardsCompatibilityTe private static final String KNN_VECTOR_FIELD = "knn_field"; private static final int DOC_COUNT = 30; private static final FieldType KNN_VECTOR_FIELD_TYPE = - KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.COSINE); + KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.DOT_PRODUCT); private static final float[] KNN_VECTOR = {0.2f, -0.1f, 0.1f}; + private static final float[] NORMALIZED_KNN_VECTOR = new float[KNN_VECTOR.length]; Review Comment: Same sort of comment as above on versioning and that we index fields into an old index. ########## lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestBasicBackwardsCompatibility.java: ########## @@ -98,8 +99,14 @@ public class TestBasicBackwardsCompatibility extends BackwardsCompatibilityTestB private static final int KNN_VECTOR_MIN_SUPPORTED_VERSION = LUCENE_9_0_0.major; private static final String KNN_VECTOR_FIELD = "knn_field"; private static final FieldType KNN_VECTOR_FIELD_TYPE = - KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.COSINE); + KnnFloatVectorField.createFieldType(3, VectorSimilarityFunction.DOT_PRODUCT); Review Comment: I am not 100% sure this will work as simply as this. We will index NEW fields into an index created with the OLD code (see where we call `addDoc`), meaning you will be trying to index a `dot_product` knn field into an index with a `cosine` field. You will have to handle the index versions from before a certain version (thus on cosine) to ones created after (and thus dot-product). ########## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ########## @@ -61,24 +60,6 @@ public float compare(byte[] v1, byte[] v2) { } }, - /** - * Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all - * vectors to unit length, and instead use {@link VectorSimilarityFunction#DOT_PRODUCT}. You - * should only use this function if you need to preserve the original vectors and cannot normalize - * them in advance. The similarity score is normalised to assure it is positive. - */ - COSINE { - @Override - public float compare(float[] v1, float[] v2) { - return Math.max((1 + cosine(v1, v2)) / 2, 0); - } - - @Override - public float compare(byte[] v1, byte[] v2) { - return (1 + cosine(v1, v2)) / 2; - } - }, - Review Comment: I think this will get tricky. If you remove this ordinal (I think it would be wise to deprecate first, work through the warnings, etc. get everything passing, etc.), field info & knn vector format readers will need to somehow distinguish between "Max inner product" (whose ordinal decreases to cosine's) and "cosine but in an old index version". Maybe on write, we adjust the ordinal dynamically, knowing what they are (thus correcting MIP's ordinal on write, allowing for readers to distinguish). ########## lucene/core/src/java/org/apache/lucene/util/quantization/ScalarQuantizedRandomVectorScorer.java: ########## @@ -30,23 +28,6 @@ public class ScalarQuantizedRandomVectorScorer extends RandomVectorScorer.AbstractRandomVectorScorer<byte[]> { - public static float quantizeQuery( Review Comment: Even in Lucene 10, we will need to support quantizing a query sent to a Lucene index that uses Cosine. ########## lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java: ########## @@ -61,24 +60,6 @@ public float compare(byte[] v1, byte[] v2) { } }, - /** - * Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all Review Comment: I think it would be good to deprecate this first, get all the internal paths updated to not use it. Then the deprecation can be merged and backported. Then work on removing it from main. Mostly because the deprecation path and internal code usage will all have to occur anyways :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org