benwtrent commented on issue #13611: URL: https://github.com/apache/lucene/issues/13611#issuecomment-2252905328
@msokolov here is a good data set to test graph connectedness improvements. @david-sitsky looking at your data, there are multiple duplicates, which will form highly clustered regions within the HNSW graph and this will lead to disconnected components and difficulty searching. I downloaded your index and debugged the graph. It isn't fully connected from what I can tell, and the searcher seems to get stuck in local minima: ``` > Task :lucene:sandbox:TestLuceneVectorSearch.main() Leaf 0 has 3 layers Leaf 0 has 10832 documents Graph level=2 size=16, Fanout min=1, mean=3.63, max=8 % 0 10 20 30 40 50 60 70 80 90 100 0 1 2 2 2 3 3 4 5 6 8 Graph level=1 size=338, Fanout min=1, mean=5.81, max=16 % 0 10 20 30 40 50 60 70 80 90 100 0 1 1 2 3 5 6 8 10 14 16 Graph level=0 size=5179, Fanout min=1, mean=4.02, max=32 % 0 10 20 30 40 50 60 70 80 90 100 0 1 1 1 1 1 1 4 7 11 32 Graph level=2 size=16, connectedness=1.00 Graph level=1 size=338, connectedness=0.99 Graph level=0 size=5179, connectedness=0.96 Similarity query results: ``` With connecting disconnected components as @msokolov 's new PR work does: ``` Graph level=2 size=16, Fanout min=1, mean=3.63, max=8 % 0 10 20 30 40 50 60 70 80 90 100 0 1 2 2 2 3 3 4 5 6 8 Graph level=1 size=338, Fanout min=1, mean=5.81, max=16 % 0 10 20 30 40 50 60 70 80 90 100 0 1 1 2 3 5 6 8 10 14 16 Graph level=0 size=5179, Fanout min=1, mean=4.15, max=32 % 0 10 20 30 40 50 60 70 80 90 100 0 1 1 1 1 1 2 4 7 11 32 Graph level=2 size=16, connectedness=1.00 Graph level=1 size=338, connectedness=0.99 Graph level=0 size=5179, connectedness=1.00 ``` Now, even with this, we get stuck in local minima with `k` so small & default graph building params. So, I experimented with a couple (here is the code I used, note, I manually patched to use @msokolov 's connectedness improvements): ``` for (int m = 16; m <= 64; m += 16) { for (int b = 100; b <= 250; b+= 50) { // lets rebuild the index no nestedness // Delete the new Dir if it exists System.out.println("-----------------------------------"); System.out.println("Rebuilding the index with m=" + m + " and b=" + b); try (Directory newDir = FSDirectory.open(Paths.get("/path/to/different/dir"))) { try (IndexWriter writer = new IndexWriter(newDir, new IndexWriterConfig().setCodec( new Lucene99Codec() { @Override public KnnVectorsFormat getKnnVectorsFormatForField(String field) { return new Lucene99HnswVectorsFormat(16, 100); } } ).setOpenMode(IndexWriterConfig.OpenMode.CREATE))) { FloatVectorValues vectorValues = leafReader.getFloatVectorValues("image-embeddings"); while (vectorValues.nextDoc() != NO_MORE_DOCS) { Document document = new Document(); document.add(new KnnFloatVectorField("image-embeddings", vectorValues.vectorValue(), VectorSimilarityFunction.DOT_PRODUCT)); writer.addDocument(document); } writer.commit(); writer.forceMerge(1); } try (DirectoryReader newDirReader = DirectoryReader.open(newDir)) { IndexSearcher newSearcher = new IndexSearcher(newDirReader); LeafReaderContext newContext = newDirReader.leaves().get(0); LeafReader newLeaf = newContext.reader(); KnnVectorsReader newVectors = ((PerFieldKnnVectorsFormat.FieldsReader) ((CodecReader) newLeaf).getVectorReader()) .getFieldReader("image-embeddings"); HnswGraph newknnValues = ((Lucene99HnswVectorsReader) newVectors).getGraph("image-embeddings"); System.out.printf("New Leaf %d has %d layers\n", newContext.ord, knnValues.numLevels()); System.out.printf("New Leaf %d has %d documents\n", newContext.ord, newLeaf.maxDoc()); printGraphFanout(newknnValues, newLeaf.maxDoc()); printGraphConnectedNess(newknnValues); System.out.println("Rebuilt index results: \n"); for (int i = 1; i <= 15; i++) { System.out.println("Top 5 results with k=" + i * 5 + " "); topDocs = newSearcher.search(new KnnFloatVectorQuery("image-embeddings", queryVector, i * 5), 5); printResults(newSearcher, topDocs); } } } } } ``` Even with all this, I couldn't get the searcher to break out of the local minima. I wonder if our connectedness check accounts for the directional nature of the graph... I will debug further as I can. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org