Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

via GitHub Wed, 30 Oct 2024 12:23:32 -0700


mayya-sharipova commented on code in PR #13651:
URL: https://github.com/apache/lucene/pull/13651#discussion_r1823251497



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101BinaryQuantizedVectorsFormat.java:
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene101;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.hnsw.FlatVectorScorerUtil;
+import org.apache.lucene.codecs.hnsw.FlatVectorsFormat;
+import org.apache.lucene.codecs.hnsw.FlatVectorsReader;
+import org.apache.lucene.codecs.hnsw.FlatVectorsWriter;
+import org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsFormat;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+
+/**
+ * Codec for encoding/decoding binary quantized vectors The binary 
quantization format used here
+ * reflects <a href="https://arxiv.org/abs/2405.12497";>RaBitQ</a>. Also see 
{@link
+ * org.apache.lucene.util.quantization.BinaryQuantizer}. Some of key features 
of RabitQ are:
+ *
+ * <ul>
+ *   <li>Estimating the distance between two vectors using their centroid 
normalized distance. This
+ *       requires some additional corrective factors, but allows for centroid 
normalization to occur
+ *       and thus enabling binary quantization.
+ *   <li>Binary quantization of centroid normalized vectors.
+ *   <li>Asymmetric quantization of vectors, where query vectors are quantized 
to half-byte
+ *       precision (normalized to the centroid) and then compared directly 
against the single bit
+ *       quantized vectors in the index.
+ *   <li>Transforming the half-byte quantized query vectors in such a way that 
the comparison with
+ *       single bit vectors can be done with bit arithmetic.
+ *   <li>Utilizing an error bias calculation enabled by the centroid 
normalization. This allows for
+ *       dynamic rescoring of vectors that fall outside a certain error 
threshold.
+ * </ul>
+ *
+ * The format is stored in two files:
+ *
+ * <h2>.veb (vector data) file</h2>
+ *
+ * <p>Stores the binary quantized vectors in a flat format. Additionally, it 
stores each vector's
+ * corrective factors. At the end of the file, additional information is 
stored for vector ordinal
+ * to centroid ordinal mapping and sparse vector information.
+ *
+ * <ul>
+ *   <li>For each vector:
+ *       <ul>
+ *         <li><b>[byte]</b> the binary quantized values, each byte holds 8 
bits.
+ *         <li><b>[float]</b> the corrective values. Two floats for Euclidean 
distance. Three floats
+ *             for the dot-product family of distances.
+ *       </ul>
+ *   <li>After the vectors, sparse vector information keeping track of 
monotonic blocks.
+ * </ul>
+ *
+ * <h2>.vemb (vector metadata) file</h2>
+ *
+ * <p>Stores the metadata for the vectors. This includes the number of 
vectors, the number of
+ * dimensions, and file offset information.
+ *
+ * <ul>
+ *   <li><b>int</b> the field number
+ *   <li><b>int</b> the vector encoding ordinal
+ *   <li><b>int</b> the vector similarity ordinal
+ *   <li><b>vint</b> the vector dimensions
+ *   <li><b>vlong</b> the offset to the vector data in the .veb file
+ *   <li><b>vlong</b> the length of the vector data in the .veb file
+ *   <li><b>vint</b> the number of vectors

Review Comment:
   Also:
   
    <li><b>[float]</b> clusterCenter
    <li><b>int</b> dotProduct of clusterCenter with itself



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101BinaryQuantizedVectorsFormat.java:
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene101;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.hnsw.FlatVectorScorerUtil;
+import org.apache.lucene.codecs.hnsw.FlatVectorsFormat;
+import org.apache.lucene.codecs.hnsw.FlatVectorsReader;
+import org.apache.lucene.codecs.hnsw.FlatVectorsWriter;
+import org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsFormat;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+
+/**
+ * Codec for encoding/decoding binary quantized vectors The binary 
quantization format used here
+ * reflects <a href="https://arxiv.org/abs/2405.12497";>RaBitQ</a>. Also see 
{@link
+ * org.apache.lucene.util.quantization.BinaryQuantizer}. Some of key features 
of RabitQ are:
+ *
+ * <ul>
+ *   <li>Estimating the distance between two vectors using their centroid 
normalized distance. This
+ *       requires some additional corrective factors, but allows for centroid 
normalization to occur
+ *       and thus enabling binary quantization.
+ *   <li>Binary quantization of centroid normalized vectors.
+ *   <li>Asymmetric quantization of vectors, where query vectors are quantized 
to half-byte
+ *       precision (normalized to the centroid) and then compared directly 
against the single bit
+ *       quantized vectors in the index.
+ *   <li>Transforming the half-byte quantized query vectors in such a way that 
the comparison with
+ *       single bit vectors can be done with bit arithmetic.
+ *   <li>Utilizing an error bias calculation enabled by the centroid 
normalization. This allows for
+ *       dynamic rescoring of vectors that fall outside a certain error 
threshold.
+ * </ul>
+ *
+ * The format is stored in two files:
+ *
+ * <h2>.veb (vector data) file</h2>
+ *
+ * <p>Stores the binary quantized vectors in a flat format. Additionally, it 
stores each vector's
+ * corrective factors. At the end of the file, additional information is 
stored for vector ordinal
+ * to centroid ordinal mapping and sparse vector information.
+ *
+ * <ul>
+ *   <li>For each vector:
+ *       <ul>
+ *         <li><b>[byte]</b> the binary quantized values, each byte holds 8 
bits.
+ *         <li><b>[float]</b> the corrective values. Two floats for Euclidean 
distance. Three floats
+ *             for the dot-product family of distances.
+ *       </ul>
+ *   <li>After the vectors, sparse vector information keeping track of 
monotonic blocks.
+ * </ul>
+ *
+ * <h2>.vemb (vector metadata) file</h2>
+ *
+ * <p>Stores the metadata for the vectors. This includes the number of 
vectors, the number of
+ * dimensions, and file offset information.
+ *
+ * <ul>
+ *   <li><b>int</b> the field number
+ *   <li><b>int</b> the vector encoding ordinal
+ *   <li><b>int</b> the vector similarity ordinal
+ *   <li><b>vint</b> the vector dimensions
+ *   <li><b>vlong</b> the offset to the vector data in the .veb file
+ *   <li><b>vlong</b> the length of the vector data in the .veb file
+ *   <li><b>vint</b> the number of vectors

Review Comment:
   Also:
   
   ```md
    <li><b>[float]</b> clusterCenter
    <li><b>int</b> dotProduct of clusterCenter with itself
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]

Reply via email to