benwtrent commented on code in PR #13910: URL: https://github.com/apache/lucene/pull/13910#discussion_r1803081052
########## lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java: ########## @@ -1906,4 +1916,122 @@ public void testMismatchedFields() throws Exception { IOUtils.close(reader, w2, dir1, dir2); } + + /** + * Test that the query is a viable approximation to exact search. This test is designed to uncover + * gross failures only, not to represent the true expected recall. + */ + public void testRecall() throws IOException { + VectorSimilarityFunction vectorSimilarityFunction = VectorSimilarityFunction.EUCLIDEAN; + int dim = 16; + try (Directory indexStore = getKnownIndexStore("field", dim, vectorSimilarityFunction); + IndexReader reader = DirectoryReader.open(indexStore)) { + IndexSearcher searcher = newSearcher(reader); + float[] queryEmbedding = new float[dim]; + String queryString = "Apache License"; + computeLineEmbedding(queryString, queryEmbedding); + // computeLineEmbedding(" END OF TERMS AND CONDITIONS", queryEmbedding); + // pass match-all "filter" to force full traversal, bypassing graph + KnnFloatVectorQuery exactQuery = + new KnnFloatVectorQuery("field", queryEmbedding, 1000, new MatchAllDocsQuery()); + // indexed 421 lines from LICENSE.txt + // indexed 157 lines from NOTICE.txt + assertEquals(578, searcher.count(exactQuery)); // Same for exact search + KnnFloatVectorQuery query = new KnnFloatVectorQuery("field", queryEmbedding, 10); + assertEquals(10, searcher.count(query)); // Expect some results without timeout + TopDocs results = searcher.search(query, 10); + Set<Integer> resultDocs = new HashSet<>(); + for (ScoreDoc scoreDoc : results.scoreDocs) { + /* + System.out.println( + "result " + i++ + ": " + reader.storedFields().document(scoreDoc.doc) + " " + scoreDoc); + */ + resultDocs.add(scoreDoc.doc); + } + TopDocs expected = searcher.search(exactQuery, 10); + // int i = 0; + int recalled = 0; + for (ScoreDoc scoreDoc : expected.scoreDocs) { + /* + System.out.println( + "expected " + + i++ + + ": " + + reader.storedFields().document(scoreDoc.doc) + + " " + + scoreDoc); + */ + if (resultDocs.contains(scoreDoc.doc)) { + ++recalled; + } + } + assertTrue("recall should be at least 5/10, got " + recalled, recalled >= 5); Review Comment: I would hope recall should be within some known parameter. It would be good to know if recall improved or worsened. Either case could show an unexpected change. ########## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java: ########## @@ -260,7 +260,7 @@ public void search(String field, float[] target, KnnCollector knnCollector, Bits int node = results.topNode(); float minSimilarity = results.topScore(); results.pop(); - knnCollector.collect(node, minSimilarity); + knnCollector.collect(vectorValues.ordToDoc(node), minSimilarity); Review Comment: Since Lucene90 didn't support sparse vector values, I am not sure this is strictly necessary. But I can understand it from a consistency standpoint. ########## lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java: ########## @@ -1906,4 +1916,122 @@ public void testMismatchedFields() throws Exception { IOUtils.close(reader, w2, dir1, dir2); } + + /** + * Test that the query is a viable approximation to exact search. This test is designed to uncover + * gross failures only, not to represent the true expected recall. + */ + public void testRecall() throws IOException { + VectorSimilarityFunction vectorSimilarityFunction = VectorSimilarityFunction.EUCLIDEAN; Review Comment: It would be really neat if this went through each one. Quantization will do special things for different similarity functions and exercising each of those paths would be good. ########## lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestLucene99ScalarQuantizedVectorsFormat.java: ########## @@ -102,6 +102,11 @@ public void testSearch() throws Exception { } } + @Override + public void testRecall() { + // ignore this test since this class always returns no results from search + } Review Comment: I still think that the underlying flat format should allow `search` to be called and it simply iterates all matching vectors. But we can adjust that in a later PR. ########## lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java: ########## @@ -1906,4 +1916,122 @@ public void testMismatchedFields() throws Exception { IOUtils.close(reader, w2, dir1, dir2); } + + /** + * Test that the query is a viable approximation to exact search. This test is designed to uncover + * gross failures only, not to represent the true expected recall. + */ + public void testRecall() throws IOException { + VectorSimilarityFunction vectorSimilarityFunction = VectorSimilarityFunction.EUCLIDEAN; + int dim = 16; + try (Directory indexStore = getKnownIndexStore("field", dim, vectorSimilarityFunction); + IndexReader reader = DirectoryReader.open(indexStore)) { + IndexSearcher searcher = newSearcher(reader); + float[] queryEmbedding = new float[dim]; + String queryString = "Apache License"; + computeLineEmbedding(queryString, queryEmbedding); + // computeLineEmbedding(" END OF TERMS AND CONDITIONS", queryEmbedding); + // pass match-all "filter" to force full traversal, bypassing graph + KnnFloatVectorQuery exactQuery = + new KnnFloatVectorQuery("field", queryEmbedding, 1000, new MatchAllDocsQuery()); + // indexed 421 lines from LICENSE.txt + // indexed 157 lines from NOTICE.txt Review Comment: If we are adding two files like this, I wonder if we should simply take real vectors from a real embedding model and put them in the resources folder. Maybe we can use glove or `sift`? Those are pretty small vectors, though only euclidean is expected (we would have to normalize for dot-product). My concern is that having such a simplistic vector with so few dimensions might not actually be useful. ########## lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java: ########## @@ -562,7 +562,7 @@ private void add( String idString = Integer.toString(id); doc.add(new StringField("id", idString, Field.Store.YES)); doc.add(new SortedDocValuesField("id", new BytesRef(idString))); - // XSSystem.out.println("add " + idString + " " + Arrays.toString(vector)); + // System.out.println("add " + idString + " " + Arrays.toString(vector)); Review Comment: Just delete the line? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org