Re: [PR] Enforce fallback support for float vector retrieval in quantized KNN vector formats [lucene]

via GitHub Tue, 09 Dec 2025 07:18:15 -0800


mikemccand commented on code in PR #15476:
URL: https://github.com/apache/lucene/pull/15476#discussion_r2603071729



##########
lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene99/TestLucene99ScalarQuantizedVectorsFormat.java:
##########
@@ -407,4 +245,19 @@ public void testRandomWithUpdatesAndGraph() {
   public void testSearchWithVisitedLimit() {
     // search not supported
   }
+
+  @Override
+  protected boolean supportsFloatVectorFallback() {
+    return true;
+  }
+
+  @Override
+  protected int getQuantizationBits() {
+    return bits;
+  }
+
+  @Override
+  protected Codec getCodecForQuantizedTest() {

Review Comment:
   Should we also rename this `Quantized` -> `FloatVectorFallback` or so?



##########
lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene102/TestLucene102BinaryQuantizedVectorsFormat.java:
##########
@@ -186,4 +186,9 @@ public void testQuantizedVectorsWriteAndRead() throws 
IOException {
       }
     }
   }
+
+  @Override
+  protected boolean supportsFloatVectorFallback() {

Review Comment:
   At first I thought "huh, maybe base class should just have default `return 
false;` impl", but then realized that's bad: we want new formats to have to 
explicitly think about this question "do I support regenerated float[] from 
compressed/quantized forms" rather than inherit dangerous default (which would 
risk losing the feature again).
   
   If Lucene had PQ (product quantization) working in a  new 
`KnnVectorsWriter`, which I think is both dimensionality reducing, and 
quantizing scalar values, it could in theory re-hydrate `float[]` I think?



##########
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##########
@@ -1910,6 +1929,163 @@ public void testVectorValuesReportCorrectDocs() throws 
Exception {
     }
   }
 
+  private List<float[]> getRandomFloatVector(int numVectors, int dim, boolean 
normalize) {
+    List<float[]> vectors = new ArrayList<>(numVectors);
+    for (int i = 0; i < numVectors; i++) {
+      float[] vec = randomVector(dim);
+      if (normalize) {
+        float[] copy = new float[vec.length];
+        System.arraycopy(vec, 0, copy, 0, copy.length);
+        VectorUtil.l2normalize(copy);
+        vec = copy;
+      }
+      vectors.add(vec);
+    }
+    return vectors;
+  }
+
+  /**
+   * Tests reading quantized vectors when raw vector data is empty. Verifies 
that scalar quantized
+   * formats can properly dequantize vectors and maintain accuracy within 
expected error bounds even
+   * when the original raw vector file is empty or corrupted.
+   */
+  public void testReadQuantizedVectorWithEmptyRawVectors() throws Exception {
+    assumeTrue("Test only applies to scalar quantized formats", 
supportsFloatVectorFallback());
+
+    String vectorFieldName = "vec1";
+    int numVectors = 1 + random().nextInt(50);
+    int dim = random().nextInt(64) + 1;
+    if (dim % 2 == 1) {
+      dim++;
+    }
+    float eps = (1f / (float) (1 << getQuantizationBits()));
+    VectorSimilarityFunction similarityFunction = randomSimilarity();
+    List<float[]> vectors =
+        getRandomFloatVector(
+            numVectors, dim, similarityFunction == 
VectorSimilarityFunction.COSINE);
+
+    try (BaseDirectoryWrapper dir = newDirectory();
+        IndexWriter w =
+            new IndexWriter(
+                dir,
+                new IndexWriterConfig()
+                    .setMaxBufferedDocs(numVectors + 1)
+                    .setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH)
+                    .setMergePolicy(NoMergePolicy.INSTANCE)
+                    .setUseCompoundFile(false)
+                    .setCodec(getCodecForQuantizedTest()))) {
+      dir.setCheckIndexOnClose(false);
+
+      for (int i = 0; i < numVectors; i++) {
+        Document doc = new Document();
+        doc.add(new KnnFloatVectorField(vectorFieldName, vectors.get(i), 
similarityFunction));
+        w.addDocument(doc);
+      }
+      w.commit();
+
+      simulateEmptyRawVectors(dir);
+
+      try (IndexReader reader = DirectoryReader.open(w)) {
+        LeafReader r = getOnlyLeafReader(reader);
+        if (r instanceof CodecReader codecReader) {
+          KnnVectorsReader knnVectorsReader = codecReader.getVectorReader();
+          knnVectorsReader = 
knnVectorsReader.unwrapReaderForField(vectorFieldName);
+          FloatVectorValues floatVectorValues =
+              knnVectorsReader.getFloatVectorValues(vectorFieldName);
+          if (floatVectorValues.size() > 0) {
+            KnnVectorValues.DocIndexIterator iter = 
floatVectorValues.iterator();
+            for (int docId = iter.nextDoc(); docId != NO_MORE_DOCS; docId = 
iter.nextDoc()) {
+              float[] dequantizedVector = 
floatVectorValues.vectorValue(iter.index());
+              float mae = 0;
+              for (int i = 0; i < dim; i++) {
+                mae += Math.abs(dequantizedVector[i] - vectors.get(docId)[i]);
+              }
+              mae /= dim;
+              assertTrue(
+                  "bits: " + getQuantizationBits() + " mae: " + mae + " > eps: 
" + eps, mae <= eps);
+            }
+          } else {
+            fail("floatVectorValues size should be non zero");
+          }
+        } else {
+          fail("reader is not CodecReader");
+        }
+      }
+    }
+  }
+
+  /** Simulates empty raw vectors by modifying index files. */
+  protected void simulateEmptyRawVectors(Directory dir) throws Exception {
+    final String[] indexFiles = dir.listAll();
+    final String RAW_VECTOR_EXTENSION = "vec";
+    final String VECTOR_META_EXTENSION = "vemf";
+
+    for (String file : indexFiles) {
+      if (file.endsWith("." + RAW_VECTOR_EXTENSION)) {
+        replaceWithEmptyVectorFile(dir, file);
+      } else if (file.endsWith("." + VECTOR_META_EXTENSION)) {
+        updateVectorMetadataFile(dir, file);
+      }
+    }
+  }
+
+  /** Replaces a raw vector file with an empty one that has valid 
header/footer. */
+  protected void replaceWithEmptyVectorFile(Directory dir, String fileName) 
throws Exception {
+    byte[] indexHeader;
+    try (IndexInput in = dir.openInput(fileName, IOContext.DEFAULT)) {
+      indexHeader = CodecUtil.readIndexHeader(in);
+    }
+    dir.deleteFile(fileName);
+    try (IndexOutput out = dir.createOutput(fileName, IOContext.DEFAULT)) {
+      // Write header
+      out.writeBytes(indexHeader, 0, indexHeader.length);
+      // Write footer (no content in between)
+      CodecUtil.writeFooter(out);
+    }
+  }
+
+  /** Updates vector metadata file to indicate zero vector length. */
+  protected void updateVectorMetadataFile(Directory dir, String fileName) 
throws Exception {

Review Comment:
   Hmm, can we move these methods down into the actual `FlatVectorsWriter`s 
impls that write this specific format?  These Codec specific details really 
shouldn't be in the abstract base class...
   
   Is the problem that we have two different `FlatVectorsFormat`s that share 
this same impl?  If so, maybe we can share somehow -- make `static` methods?



##########
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##########
@@ -117,12 +124,24 @@ public abstract class BaseKnnVectorsFormatTestCase 
extends BaseIndexFileFormatTe
   private VectorEncoding vectorEncoding;
   private VectorSimilarityFunction similarityFunction;
 
+  final int DIRECT_MONOTONIC_BLOCK_SHIFT = 16;
+
   @Before
   public void init() {
     vectorEncoding = randomVectorEncoding();
     similarityFunction = randomSimilarity();
   }
 
+  protected abstract boolean supportsFloatVectorFallback();
+
+  protected int getQuantizationBits() {

Review Comment:
   Add comment explaining that this is for computing epsilon tolerance of float 
quantization errors in the test case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Enforce fallback support for float vector retrieval in quantized KNN vector formats [lucene]

Reply via email to