mccullocht commented on code in PR #15271:
URL: https://github.com/apache/lucene/pull/15271#discussion_r2395234917


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene104/Lucene104ScalarQuantizedVectorsFormat.java:
##########
@@ -186,16 +203,35 @@ public Lucene104ScalarQuantizedVectorsFormat() {
     this(ScalarEncoding.UNSIGNED_BYTE);
   }
 
-  /** Creates a new instance with the chosen encoding. */
+  /** Creates a new instance with the chosen symmetric quantization encoding. 
*/
   public Lucene104ScalarQuantizedVectorsFormat(ScalarEncoding encoding) {
+    this(encoding, encoding);
+  }
+
+  /** Creates a new instance with the chosen asymmetric quantization encoding. 
*/
+  public Lucene104ScalarQuantizedVectorsFormat(
+      ScalarEncoding encoding, ScalarEncoding queryEncoding) {
     super(NAME);
     this.encoding = encoding;
+    this.queryEncoding = queryEncoding;
+    // until we have optimized scorers for various other asymmetric encodings, 
maybe we only allow 1

Review Comment:
   Should ScalarEncoding enumerate both the query and index vector coding? It 
would turn this error into a compile-time error.
   
   Using `PACKED_NIBBLE` for asymmetric 4:1 is a bit weird because it's packed 
in a very different way.
   
   FWIW agree about being able to do other combinations, in particular I think 
8 -> N for all values of N we want to support is a good idea. I know for 4:1 we 
write two vector sets during index building to improve scores/graph quality and 
I'm not sure if that's necessary in other cases but I'm also not sure how much 
it would hurt.



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene104/Lucene104ScalarQuantizedVectorScorer.java:
##########
@@ -131,14 +205,17 @@ public float score(int node) throws IOException {
         public void setScoringOrdinal(int node) throws IOException {
           var rawTargetVector = targetValues.vectorValue(node);
           switch (values.getScalarEncoding()) {
-            case UNSIGNED_BYTE -> targetVector = rawTargetVector;
-            case SEVEN_BIT -> targetVector = rawTargetVector;
+            case UNSIGNED_BYTE, SEVEN_BIT -> targetVector = rawTargetVector;

Review Comment:
   Can this class be folded into the asymmetric implementation and removed? I 
think when I wrote this `VectorUtil.int4DotProductBothPacked` didn't exist so I 
needed to unpack the value, but that shouldn't be necessary anymore.



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene104/Lucene104ScalarQuantizedVectorsWriter.java:
##########
@@ -385,6 +389,63 @@ static DocsWithFieldSet writeVectorData(
     return docsWithField;
   }
 
+  static DocsWithFieldSet writeBinarizedVectorAndQueryData(
+      IndexOutput binarizedVectorData,
+      ScalarEncoding encoding,
+      IndexOutput binarizedQueryData,
+      ScalarEncoding queryEncoding,
+      FloatVectorValues floatVectorValues,
+      float[] centroid,
+      OptimizedScalarQuantizer binaryQuantizer)
+      throws IOException {
+    if (encoding == queryEncoding) {
+      throw new IllegalArgumentException("encoding and queryEncoding must be 
different");
+    }
+    if (encoding != ScalarEncoding.SINGLE_BIT || queryEncoding != 
ScalarEncoding.PACKED_NIBBLE) {
+      throw new IllegalArgumentException(
+          "encoding must be SINGLE_BIT and queryEncoding must be 
PACKED_NIBBLE");
+    }
+    DocsWithFieldSet docsWithField = new DocsWithFieldSet();
+    int discretizedDims =
+        Math.max(
+            encoding.getDiscreteDimensions(floatVectorValues.dimension()),
+            
queryEncoding.getDiscreteDimensions(floatVectorValues.dimension()));
+    assert discretizedDims % encoding.getBits() == 0;
+    assert discretizedDims % queryEncoding.getBits() == 0;
+    byte[][] quantizationScratch = new byte[2][];
+    quantizationScratch[0] = new byte[discretizedDims];
+    quantizationScratch[1] = new byte[discretizedDims];
+    byte[] toIndex = new byte[encoding.getPackedLength(discretizedDims)];
+    byte[] toQuery = new byte[queryEncoding.getPackedLength(discretizedDims)];
+    KnnVectorValues.DocIndexIterator iterator = floatVectorValues.iterator();
+    for (int docV = iterator.nextDoc(); docV != NO_MORE_DOCS; docV = 
iterator.nextDoc()) {
+      // write index vector
+      OptimizedScalarQuantizer.QuantizationResult[] r =
+          binaryQuantizer.multiScalarQuantize(
+              floatVectorValues.vectorValue(iterator.index()),
+              quantizationScratch,
+              new byte[] {INDEX_BITS, QUERY_BITS},
+              centroid);
+      // pack and store document bit vector
+      packAsBinary(quantizationScratch[0], toIndex);
+      binarizedVectorData.writeBytes(toIndex, toIndex.length);
+      binarizedVectorData.writeInt(Float.floatToIntBits(r[0].lowerInterval()));
+      binarizedVectorData.writeInt(Float.floatToIntBits(r[0].upperInterval()));
+      
binarizedVectorData.writeInt(Float.floatToIntBits(r[0].additionalCorrection()));
+      binarizedVectorData.writeInt(r[0].quantizedComponentSum());

Review Comment:
   This will result in two extra bytes per vector compared to the 102 codec, 
just checking that's fine (I personally don't have a problem with this).



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene104/Lucene104ScalarQuantizedVectorsWriter.java:
##########
@@ -190,22 +196,20 @@ private void writeVectors(
       FieldWriter fieldData, float[] clusterCenter, OptimizedScalarQuantizer 
scalarQuantizer)
       throws IOException {
     byte[] scratch =
-        new byte
-            [OptimizedScalarQuantizer.discretize(
-                fieldData.fieldInfo.getVectorDimension(), 
encoding.getDimensionsPerByte())];
+        new 
byte[encoding.getDiscreteDimensions(fieldData.fieldInfo.getVectorDimension())];
     byte[] vector =
         switch (encoding) {
-          case UNSIGNED_BYTE -> scratch;
-          case SEVEN_BIT -> scratch;
-          case PACKED_NIBBLE ->
-              new 
byte[encoding.getPackedLength(fieldData.fieldInfo.getVectorDimension())];
+          case UNSIGNED_BYTE, SEVEN_BIT -> scratch;

Review Comment:
   [not actionable] ooh nice, didn't realize you could combine cases like this.



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene104/Lucene104ScalarQuantizedVectorsWriter.java:
##########
@@ -190,22 +196,20 @@ private void writeVectors(
       FieldWriter fieldData, float[] clusterCenter, OptimizedScalarQuantizer 
scalarQuantizer)
       throws IOException {
     byte[] scratch =
-        new byte
-            [OptimizedScalarQuantizer.discretize(
-                fieldData.fieldInfo.getVectorDimension(), 
encoding.getDimensionsPerByte())];
+        new 
byte[encoding.getDiscreteDimensions(fieldData.fieldInfo.getVectorDimension())];
     byte[] vector =
         switch (encoding) {
-          case UNSIGNED_BYTE -> scratch;
-          case SEVEN_BIT -> scratch;
-          case PACKED_NIBBLE ->
-              new 
byte[encoding.getPackedLength(fieldData.fieldInfo.getVectorDimension())];
+          case UNSIGNED_BYTE, SEVEN_BIT -> scratch;
+          case PACKED_NIBBLE, SINGLE_BIT -> new 
byte[encoding.getPackedLength(scratch.length)];
         };
     for (int i = 0; i < fieldData.getVectors().size(); i++) {
       float[] v = fieldData.getVectors().get(i);
       OptimizedScalarQuantizer.QuantizationResult corrections =
           scalarQuantizer.scalarQuantize(v, scratch, encoding.getBits(), 
clusterCenter);
       if (encoding == ScalarEncoding.PACKED_NIBBLE) {
         OffHeapScalarQuantizedVectorValues.packNibbles(scratch, vector);
+      } else if (encoding == ScalarEncoding.SINGLE_BIT) {

Review Comment:
   might be worth converting this back into a switch, i made it an if statement 
initially because intellij complained that there weren't enough cases. this 
would help folks adding new encodings in the future.



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene104/Lucene104ScalarQuantizedVectorsWriter.java:
##########
@@ -251,22 +255,20 @@ private void writeSortedVectors(
       OptimizedScalarQuantizer scalarQuantizer)
       throws IOException {
     byte[] scratch =
-        new byte
-            [OptimizedScalarQuantizer.discretize(
-                fieldData.fieldInfo.getVectorDimension(), 
encoding.getDimensionsPerByte())];
+        new 
byte[encoding.getDiscreteDimensions(fieldData.fieldInfo.getVectorDimension())];
     byte[] vector =
         switch (encoding) {
-          case UNSIGNED_BYTE -> scratch;
-          case SEVEN_BIT -> scratch;
-          case PACKED_NIBBLE ->
-              new 
byte[encoding.getPackedLength(fieldData.fieldInfo.getVectorDimension())];
+          case UNSIGNED_BYTE, SEVEN_BIT -> scratch;
+          case PACKED_NIBBLE, SINGLE_BIT -> new 
byte[encoding.getPackedLength(scratch.length)];
         };
     for (int ordinal : ordMap) {
       float[] v = fieldData.getVectors().get(ordinal);
       OptimizedScalarQuantizer.QuantizationResult corrections =
           scalarQuantizer.scalarQuantize(v, scratch, encoding.getBits(), 
clusterCenter);
       if (encoding == ScalarEncoding.PACKED_NIBBLE) {
         OffHeapScalarQuantizedVectorValues.packNibbles(scratch, vector);
+      } else if (encoding == ScalarEncoding.SINGLE_BIT) {

Review Comment:
   same here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to