john-wagster commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1742666476
########## lucene/core/src/java/org/apache/lucene/codecs/lucene912/BinarizedByteVectorValues.java: ########## @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene912; + +import java.io.IOException; +import org.apache.lucene.index.ByteVectorValues; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.VectorScorer; + +/** + * A version of {@link ByteVectorValues}, but additionally retrieving score correction values offset + * for binarization quantization scores. + * + * @lucene.experimental + */ +public abstract class BinarizedByteVectorValues extends DocIdSetIterator { + public abstract float getDistanceToCentroid() throws IOException; + + /** Returns the cluster ID for the vector in the range [0, 255] */ + public abstract short clusterId() throws IOException; + + public static byte encodeClusterIdToByte(short clusterId) { + byte bClusterId = clusterId <= 127 ? (byte) clusterId : (byte) (clusterId - 256); + return bClusterId; + } + + public static short decodeClusterIdFromByte(byte bClusterId) { + short clusterId = bClusterId >= 0 ? (short) bClusterId : (short) (bClusterId + 256); + return clusterId; + } + + public abstract float getMagnitude() throws IOException; + + public abstract float getOOQ() throws IOException; + + public abstract float getNormOC() throws IOException; + + public abstract float getODotC() throws IOException; Review Comment: Was naively following the pattern to get things working. This had getMagnitude and getDistanceToCentroid in it prior to adding getOOQ, NormOC, and ODotC. For MIP it only needs OOQ, NormC, and ODotC and essentially ignores (actually overlaps with the other values or 0f). So I don't think it blows up as is but it is storing an extra unnecessary float for Euclidean now. But I could be missing something. Two thoughts I had and hadn't quite gotten to yet. I think I agree with your thoughts here that this should deal with arbitrary float (or byte) values. I think where we had last landed was some way to decode those values on both read and write where the Scorer? or Format? (via reflection? to avoid versioning problems?) would supply to the Reader and Writer how to serialize and deserialize appropriately based on the similarity function a set of bytes which would eliminate some overhead there (1 byte for Euclidean). I would buy we could update these interfaces to deal with an array of floats in the meantime depending on if quantization of these corrective factors is not a high priority; I was operating right now under the assumption that we do want to try to refactor to store a minimum overhead for each vector (2 bytes / vector in the "Euclidean" and 3 bytes in the "MIP" use-case). Second thought was I think for all of these values we want to swap them out with 1 byte quantized forms and without having started that work yet it seems pretty straightforward in my head to do so, which is probably why I'm seeing it as part of this first pass. Was thinking to do both of things together and refactor this to return a getCorrectiveBytes instead of these individual corrective factors and put the burden on the caller to decode them. There is a separate place where specifically for query (not indexing) instead of pulling through a set of corrective float[] factors we are pulling through an object QueryFactors now, which may be is worth discussing further (but is a different place in the code)? I did this because we could optimize one of the floats to a short and with the belief that we could / want to optimize the other values as well (as well as greatly improving readability). Although subsequent conversations about those with the team suggest that we don't care about quantizing these values (at least not as much) because the query quantization that gets serialized is only serialized to a temporary file. I digress. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org