Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]

via GitHub Wed, 16 Oct 2024 06:34:29 -0700


msokolov commented on code in PR #13910:
URL: https://github.com/apache/lucene/pull/13910#discussion_r1803114051



##########
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##########
@@ -1906,4 +1916,122 @@ public void testMismatchedFields() throws Exception {
 
     IOUtils.close(reader, w2, dir1, dir2);
   }
+
+  /**
+   * Test that the query is a viable approximation to exact search. This test 
is designed to uncover
+   * gross failures only, not to represent the true expected recall.
+   */
+  public void testRecall() throws IOException {
+    VectorSimilarityFunction vectorSimilarityFunction = 
VectorSimilarityFunction.EUCLIDEAN;
+    int dim = 16;
+    try (Directory indexStore = getKnownIndexStore("field", dim, 
vectorSimilarityFunction);
+        IndexReader reader = DirectoryReader.open(indexStore)) {
+      IndexSearcher searcher = newSearcher(reader);
+      float[] queryEmbedding = new float[dim];
+      String queryString = "Apache License";
+      computeLineEmbedding(queryString, queryEmbedding);
+      // computeLineEmbedding("   END OF TERMS AND CONDITIONS", 
queryEmbedding);
+      // pass match-all "filter" to force full traversal, bypassing graph
+      KnnFloatVectorQuery exactQuery =
+          new KnnFloatVectorQuery("field", queryEmbedding, 1000, new 
MatchAllDocsQuery());
+      // indexed 421 lines from LICENSE.txt
+      // indexed 157 lines from NOTICE.txt

Review Comment:
   lemme see if I can find bugs with the "dictionary" PR? We can always beef 
this up with more realistic data 



##########
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##########
@@ -1906,4 +1916,122 @@ public void testMismatchedFields() throws Exception {
 
     IOUtils.close(reader, w2, dir1, dir2);
   }
+
+  /**
+   * Test that the query is a viable approximation to exact search. This test 
is designed to uncover
+   * gross failures only, not to represent the true expected recall.
+   */
+  public void testRecall() throws IOException {
+    VectorSimilarityFunction vectorSimilarityFunction = 
VectorSimilarityFunction.EUCLIDEAN;
+    int dim = 16;
+    try (Directory indexStore = getKnownIndexStore("field", dim, 
vectorSimilarityFunction);
+        IndexReader reader = DirectoryReader.open(indexStore)) {
+      IndexSearcher searcher = newSearcher(reader);
+      float[] queryEmbedding = new float[dim];
+      String queryString = "Apache License";
+      computeLineEmbedding(queryString, queryEmbedding);
+      // computeLineEmbedding("   END OF TERMS AND CONDITIONS", 
queryEmbedding);
+      // pass match-all "filter" to force full traversal, bypassing graph
+      KnnFloatVectorQuery exactQuery =
+          new KnnFloatVectorQuery("field", queryEmbedding, 1000, new 
MatchAllDocsQuery());
+      // indexed 421 lines from LICENSE.txt
+      // indexed 157 lines from NOTICE.txt
+      assertEquals(578, searcher.count(exactQuery)); // Same for exact search
+      KnnFloatVectorQuery query = new KnnFloatVectorQuery("field", 
queryEmbedding, 10);
+      assertEquals(10, searcher.count(query)); // Expect some results without 
timeout
+      TopDocs results = searcher.search(query, 10);
+      Set<Integer> resultDocs = new HashSet<>();
+      for (ScoreDoc scoreDoc : results.scoreDocs) {
+        /*
+        System.out.println(
+            "result " + i++ + ": " + 
reader.storedFields().document(scoreDoc.doc) + " " + scoreDoc);
+        */
+        resultDocs.add(scoreDoc.doc);
+      }
+      TopDocs expected = searcher.search(exactQuery, 10);
+      // int i = 0;
+      int recalled = 0;
+      for (ScoreDoc scoreDoc : expected.scoreDocs) {
+        /*
+        System.out.println(
+            "expected "
+                + i++
+                + ": "
+                + reader.storedFields().document(scoreDoc.doc)
+                + " "
+                + scoreDoc);
+        */
+        if (resultDocs.contains(scoreDoc.doc)) {
+          ++recalled;
+        }
+      }
+      assertTrue("recall should be at least 5/10, got " + recalled, recalled 
>= 5);

Review Comment:
   I think it could be tricky to be very precise given the range of codec 
options. I guess we could specialize per codec?



##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java:
##########
@@ -260,7 +260,7 @@ public void search(String field, float[] target, 
KnnCollector knnCollector, Bits
       int node = results.topNode();
       float minSimilarity = results.topScore();
       results.pop();
-      knnCollector.collect(node, minSimilarity);
+      knnCollector.collect(vectorValues.ordToDoc(node), minSimilarity);

Review Comment:
   Oh, that's a relief! I couldn't remember if we had that or not. At any rate 
it is possible to create a sparse 90 index in tests now.



##########
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##########
@@ -1906,4 +1916,122 @@ public void testMismatchedFields() throws Exception {
 
     IOUtils.close(reader, w2, dir1, dir2);
   }
+
+  /**
+   * Test that the query is a viable approximation to exact search. This test 
is designed to uncover
+   * gross failures only, not to represent the true expected recall.
+   */
+  public void testRecall() throws IOException {
+    VectorSimilarityFunction vectorSimilarityFunction = 
VectorSimilarityFunction.EUCLIDEAN;

Review Comment:
   agreed, we want to test all the things



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add BaseKnnVectorsFormatTestCase.testRecall() and fix old codecs [lucene]

Reply via email to