Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

via GitHub Tue, 03 Sep 2024 13:44:11 -0700


john-wagster commented on code in PR #13651:
URL: https://github.com/apache/lucene/pull/13651#discussion_r1742666476



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene912/BinarizedByteVectorValues.java:
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene912;
+
+import java.io.IOException;
+import org.apache.lucene.index.ByteVectorValues;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.VectorScorer;
+
+/**
+ * A version of {@link ByteVectorValues}, but additionally retrieving score 
correction values offset
+ * for binarization quantization scores.
+ *
+ * @lucene.experimental
+ */
+public abstract class BinarizedByteVectorValues extends DocIdSetIterator {
+  public abstract float getDistanceToCentroid() throws IOException;
+
+  /** Returns the cluster ID for the vector in the range [0, 255] */
+  public abstract short clusterId() throws IOException;
+
+  public static byte encodeClusterIdToByte(short clusterId) {
+    byte bClusterId = clusterId <= 127 ? (byte) clusterId : (byte) (clusterId 
- 256);
+    return bClusterId;
+  }
+
+  public static short decodeClusterIdFromByte(byte bClusterId) {
+    short clusterId = bClusterId >= 0 ? (short) bClusterId : (short) 
(bClusterId + 256);
+    return clusterId;
+  }
+
+  public abstract float getMagnitude() throws IOException;
+
+  public abstract float getOOQ() throws IOException;
+
+  public abstract float getNormOC() throws IOException;
+
+  public abstract float getODotC() throws IOException;

Review Comment:
   Was naively following the pattern to get things working.  This had 
getMagnitude and getDistanceToCentroid in it prior to adding getOOQ, NormOC, 
and ODotC.  For MIP it only needs OOQ, NormC, and ODotC and essentially ignores 
(actually overlaps with the other values or 0f).  So I don't think it blows up 
as is but it is storing an extra unnecessary float for Euclidean now.  But I 
could be missing something.  
   
   Two thoughts I had and hadn't quite gotten to yet.  I think I agree with 
your thoughts here that this should deal with arbitrary float (or byte) values. 
 I think where we had last landed was some way to decode those values on both 
read and write where the Scorer? or Format? (via reflection? to avoid 
versioning problems?) would supply to the Reader and Writer how to serialize 
and deserialize appropriately based on the similarity function a set of bytes 
which would eliminate some overhead there (1 byte for Euclidean).   I would buy 
we could update these interfaces to deal with an array of floats in the 
meantime depending on if quantization of these corrective factors is not a high 
priority; I was operating right now under the assumption that we do want to try 
to refactor to store a minimum overhead for each vector (2 bytes / vector in 
the "Euclidean" and 3 bytes in the "MIP" use-case).
   
   Second thought was I think for all of these values we want to swap them out 
with 1 byte quantized forms and without having started that work yet it seems 
pretty straightforward in my head to do so, which is probably why I'm seeing it 
as part of this first pass.  
   
   Was thinking to do both of things together and refactor this to return a 
getCorrectiveBytes instead of these individual corrective factors and put the 
burden on the caller to decode them.
   
   There is a separate place where specifically for query (not indexing) 
instead of pulling through a set of corrective float[] factors we are pulling 
through an object QueryFactors now, which may be is worth discussing further 
(but is a different place in the code)?  I did this because we could optimize 
one of the floats to a short and with the belief that we could / want to 
optimize the other values as well (as well as greatly improving readability).  
Although subsequent conversations about those with the team suggest that we 
don't care about quantizing these values (at least not as much) because the 
query quantization that gets serialized is only serialized to a temporary file. 
 I digress.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

Reply via email to