Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]

via GitHub Wed, 16 Oct 2024 06:22:18 -0700


benwtrent commented on code in PR #13910:
URL: https://github.com/apache/lucene/pull/13910#discussion_r1803081052



##########
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##########
@@ -1906,4 +1916,122 @@ public void testMismatchedFields() throws Exception {
 
     IOUtils.close(reader, w2, dir1, dir2);
   }
+
+  /**
+   * Test that the query is a viable approximation to exact search. This test 
is designed to uncover
+   * gross failures only, not to represent the true expected recall.
+   */
+  public void testRecall() throws IOException {
+    VectorSimilarityFunction vectorSimilarityFunction = 
VectorSimilarityFunction.EUCLIDEAN;
+    int dim = 16;
+    try (Directory indexStore = getKnownIndexStore("field", dim, 
vectorSimilarityFunction);
+        IndexReader reader = DirectoryReader.open(indexStore)) {
+      IndexSearcher searcher = newSearcher(reader);
+      float[] queryEmbedding = new float[dim];
+      String queryString = "Apache License";
+      computeLineEmbedding(queryString, queryEmbedding);
+      // computeLineEmbedding("   END OF TERMS AND CONDITIONS", 
queryEmbedding);
+      // pass match-all "filter" to force full traversal, bypassing graph
+      KnnFloatVectorQuery exactQuery =
+          new KnnFloatVectorQuery("field", queryEmbedding, 1000, new 
MatchAllDocsQuery());
+      // indexed 421 lines from LICENSE.txt
+      // indexed 157 lines from NOTICE.txt
+      assertEquals(578, searcher.count(exactQuery)); // Same for exact search
+      KnnFloatVectorQuery query = new KnnFloatVectorQuery("field", 
queryEmbedding, 10);
+      assertEquals(10, searcher.count(query)); // Expect some results without 
timeout
+      TopDocs results = searcher.search(query, 10);
+      Set<Integer> resultDocs = new HashSet<>();
+      for (ScoreDoc scoreDoc : results.scoreDocs) {
+        /*
+        System.out.println(
+            "result " + i++ + ": " + 
reader.storedFields().document(scoreDoc.doc) + " " + scoreDoc);
+        */
+        resultDocs.add(scoreDoc.doc);
+      }
+      TopDocs expected = searcher.search(exactQuery, 10);
+      // int i = 0;
+      int recalled = 0;
+      for (ScoreDoc scoreDoc : expected.scoreDocs) {
+        /*
+        System.out.println(
+            "expected "
+                + i++
+                + ": "
+                + reader.storedFields().document(scoreDoc.doc)
+                + " "
+                + scoreDoc);
+        */
+        if (resultDocs.contains(scoreDoc.doc)) {
+          ++recalled;
+        }
+      }
+      assertTrue("recall should be at least 5/10, got " + recalled, recalled 
>= 5);

Review Comment:
   I would hope recall should be within some known parameter. It would be good 
to know if recall improved or worsened. Either case could show an unexpected 
change.



##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java:
##########
@@ -260,7 +260,7 @@ public void search(String field, float[] target, 
KnnCollector knnCollector, Bits
       int node = results.topNode();
       float minSimilarity = results.topScore();
       results.pop();
-      knnCollector.collect(node, minSimilarity);
+      knnCollector.collect(vectorValues.ordToDoc(node), minSimilarity);

Review Comment:
   Since Lucene90 didn't support sparse vector values, I am not sure this is 
strictly necessary. But I can understand it from a consistency standpoint.



##########
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##########
@@ -1906,4 +1916,122 @@ public void testMismatchedFields() throws Exception {
 
     IOUtils.close(reader, w2, dir1, dir2);
   }
+
+  /**
+   * Test that the query is a viable approximation to exact search. This test 
is designed to uncover
+   * gross failures only, not to represent the true expected recall.
+   */
+  public void testRecall() throws IOException {
+    VectorSimilarityFunction vectorSimilarityFunction = 
VectorSimilarityFunction.EUCLIDEAN;

Review Comment:
   It would be really neat if this went through each one. Quantization will do 
special things for different similarity functions and exercising each of those 
paths would be good.



##########
lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestLucene99ScalarQuantizedVectorsFormat.java:
##########
@@ -102,6 +102,11 @@ public void testSearch() throws Exception {
     }
   }
 
+  @Override
+  public void testRecall() {
+    // ignore this test since this class always returns no results from search
+  }

Review Comment:
   I still think that the underlying flat format should allow `search` to be 
called and it simply iterates all matching vectors. But we can adjust that in a 
later PR.



##########
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##########
@@ -1906,4 +1916,122 @@ public void testMismatchedFields() throws Exception {
 
     IOUtils.close(reader, w2, dir1, dir2);
   }
+
+  /**
+   * Test that the query is a viable approximation to exact search. This test 
is designed to uncover
+   * gross failures only, not to represent the true expected recall.
+   */
+  public void testRecall() throws IOException {
+    VectorSimilarityFunction vectorSimilarityFunction = 
VectorSimilarityFunction.EUCLIDEAN;
+    int dim = 16;
+    try (Directory indexStore = getKnownIndexStore("field", dim, 
vectorSimilarityFunction);
+        IndexReader reader = DirectoryReader.open(indexStore)) {
+      IndexSearcher searcher = newSearcher(reader);
+      float[] queryEmbedding = new float[dim];
+      String queryString = "Apache License";
+      computeLineEmbedding(queryString, queryEmbedding);
+      // computeLineEmbedding("   END OF TERMS AND CONDITIONS", 
queryEmbedding);
+      // pass match-all "filter" to force full traversal, bypassing graph
+      KnnFloatVectorQuery exactQuery =
+          new KnnFloatVectorQuery("field", queryEmbedding, 1000, new 
MatchAllDocsQuery());
+      // indexed 421 lines from LICENSE.txt
+      // indexed 157 lines from NOTICE.txt

Review Comment:
   If we are adding two files like this, I wonder if we should simply take real 
vectors from a real embedding model and put them in the resources folder. 
   
   Maybe we can use glove or `sift`? Those are pretty small vectors, though 
only euclidean is expected (we would have to normalize for dot-product).
   
   My concern is that having such a simplistic vector with so few dimensions 
might not actually be useful.



##########
lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java:
##########
@@ -562,7 +562,7 @@ private void add(
     String idString = Integer.toString(id);
     doc.add(new StringField("id", idString, Field.Store.YES));
     doc.add(new SortedDocValuesField("id", new BytesRef(idString)));
-    // XSSystem.out.println("add " + idString + " " + Arrays.toString(vector));
+    // System.out.println("add " + idString + " " + Arrays.toString(vector));

Review Comment:
   Just delete the line?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]

Reply via email to