Re: [PR] Enforce fallback support for float vector retrieval in quantized KNN vector formats [lucene]

via GitHub Fri, 12 Dec 2025 18:20:35 -0800


mikemccand commented on code in PR #15476:
URL: https://github.com/apache/lucene/pull/15476#discussion_r2615995352



##########
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##########
@@ -1910,6 +1929,163 @@ public void testVectorValuesReportCorrectDocs() throws 
Exception {
     }
   }
 
+  private List<float[]> getRandomFloatVector(int numVectors, int dim, boolean 
normalize) {
+    List<float[]> vectors = new ArrayList<>(numVectors);
+    for (int i = 0; i < numVectors; i++) {
+      float[] vec = randomVector(dim);
+      if (normalize) {
+        float[] copy = new float[vec.length];
+        System.arraycopy(vec, 0, copy, 0, copy.length);
+        VectorUtil.l2normalize(copy);
+        vec = copy;
+      }
+      vectors.add(vec);
+    }
+    return vectors;
+  }
+
+  /**
+   * Tests reading quantized vectors when raw vector data is empty. Verifies 
that scalar quantized
+   * formats can properly dequantize vectors and maintain accuracy within 
expected error bounds even
+   * when the original raw vector file is empty or corrupted.
+   */
+  public void testReadQuantizedVectorWithEmptyRawVectors() throws Exception {
+    assumeTrue("Test only applies to scalar quantized formats", 
supportsFloatVectorFallback());
+
+    String vectorFieldName = "vec1";
+    int numVectors = 1 + random().nextInt(50);
+    int dim = random().nextInt(64) + 1;
+    if (dim % 2 == 1) {
+      dim++;
+    }
+    float eps = (1f / (float) (1 << getQuantizationBits()));
+    VectorSimilarityFunction similarityFunction = randomSimilarity();
+    List<float[]> vectors =
+        getRandomFloatVector(
+            numVectors, dim, similarityFunction == 
VectorSimilarityFunction.COSINE);
+
+    try (BaseDirectoryWrapper dir = newDirectory();
+        IndexWriter w =
+            new IndexWriter(
+                dir,
+                new IndexWriterConfig()
+                    .setMaxBufferedDocs(numVectors + 1)
+                    .setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH)
+                    .setMergePolicy(NoMergePolicy.INSTANCE)
+                    .setUseCompoundFile(false)
+                    .setCodec(getCodecForQuantizedTest()))) {
+      dir.setCheckIndexOnClose(false);
+
+      for (int i = 0; i < numVectors; i++) {
+        Document doc = new Document();
+        doc.add(new KnnFloatVectorField(vectorFieldName, vectors.get(i), 
similarityFunction));
+        w.addDocument(doc);
+      }
+      w.commit();
+
+      simulateEmptyRawVectors(dir);
+
+      try (IndexReader reader = DirectoryReader.open(w)) {
+        LeafReader r = getOnlyLeafReader(reader);
+        if (r instanceof CodecReader codecReader) {
+          KnnVectorsReader knnVectorsReader = codecReader.getVectorReader();
+          knnVectorsReader = 
knnVectorsReader.unwrapReaderForField(vectorFieldName);
+          FloatVectorValues floatVectorValues =
+              knnVectorsReader.getFloatVectorValues(vectorFieldName);
+          if (floatVectorValues.size() > 0) {
+            KnnVectorValues.DocIndexIterator iter = 
floatVectorValues.iterator();
+            for (int docId = iter.nextDoc(); docId != NO_MORE_DOCS; docId = 
iter.nextDoc()) {
+              float[] dequantizedVector = 
floatVectorValues.vectorValue(iter.index());
+              float mae = 0;
+              for (int i = 0; i < dim; i++) {
+                mae += Math.abs(dequantizedVector[i] - vectors.get(docId)[i]);
+              }
+              mae /= dim;
+              assertTrue(
+                  "bits: " + getQuantizationBits() + " mae: " + mae + " > eps: 
" + eps, mae <= eps);
+            }
+          } else {
+            fail("floatVectorValues size should be non zero");
+          }
+        } else {
+          fail("reader is not CodecReader");
+        }
+      }
+    }
+  }
+
+  /** Simulates empty raw vectors by modifying index files. */
+  protected void simulateEmptyRawVectors(Directory dir) throws Exception {
+    final String[] indexFiles = dir.listAll();
+    final String RAW_VECTOR_EXTENSION = "vec";
+    final String VECTOR_META_EXTENSION = "vemf";
+
+    for (String file : indexFiles) {
+      if (file.endsWith("." + RAW_VECTOR_EXTENSION)) {
+        replaceWithEmptyVectorFile(dir, file);
+      } else if (file.endsWith("." + VECTOR_META_EXTENSION)) {
+        updateVectorMetadataFile(dir, file);
+      }
+    }
+  }
+
+  /** Replaces a raw vector file with an empty one that has valid 
header/footer. */
+  protected void replaceWithEmptyVectorFile(Directory dir, String fileName) 
throws Exception {
+    byte[] indexHeader;
+    try (IndexInput in = dir.openInput(fileName, IOContext.DEFAULT)) {
+      indexHeader = CodecUtil.readIndexHeader(in);
+    }
+    dir.deleteFile(fileName);
+    try (IndexOutput out = dir.createOutput(fileName, IOContext.DEFAULT)) {
+      // Write header
+      out.writeBytes(indexHeader, 0, indexHeader.length);
+      // Write footer (no content in between)
+      CodecUtil.writeFooter(out);
+    }
+  }
+
+  /** Updates vector metadata file to indicate zero vector length. */
+  protected void updateVectorMetadataFile(Directory dir, String fileName) 
throws Exception {

Review Comment:
   Oh sorry, I meant move these methods down into the test classes that 
correspond to the format you are tweaking, i.e. `TestXXVectorsFormat` classes.  
I think specifically to `TestLucene99HnswVectorsFormat.java`?
   
   My logic is that how the subclasses of this test class go and truncate to 
empty full precision vector files is a Codec implementation dependent thing, 
whereas this `Base` class should be agnostic to such Codec specifics.  Multiple 
Codecs should be able to re-use this class.  E.g. 
`TestSimpleTextKnnVectorsFormat` would need to implement its own way to zero 
out the vector files?  Maybe we could add that here/now?  This way it exercises 
that this base test class is in fact generic enough to test across codecs... 
(and this is sort to purpose of the `SimpleTextCodec` -- to show that the Codec 
API really is generic enough to correctly support a wildly different kind of 
Codec implementation.  Plus it's a super coo way to debug indexing problems 
since you can go and look at the text files.)
   
   (Separately: we don't seem to have `BaseFlatVectorsFormatTestCase` nor 
separate test cases for each flat format?  Does `BaseKnnVectorsFormatTestCase` 
test the `FlatVectorsFormat` too maybe?  Anyway, let's not try to solve that 
here -- PNP!).
   
   > Alternatively, we could make this part of the codec itself to create empty 
vector files during flush, but this functionality isn't currently supported and 
might be too invasive a change for this particular fix.
   
   Yeah let's not try to do that here -- that indeed belongs under #13158 -- I 
think having Codec write/own (inventory) the empty vector files seems like a 
good approach, but it's likely complicated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Enforce fallback support for float vector retrieval in quantized KNN vector formats [lucene]

Reply via email to