benwtrent commented on issue #13611:
URL: https://github.com/apache/lucene/issues/13611#issuecomment-2252905328

   @msokolov here is a good data set to test graph connectedness improvements. 
   
   @david-sitsky looking at your data, there are multiple duplicates, which 
will form highly clustered regions within the HNSW graph and this will lead to 
disconnected components and difficulty searching.
   
   I downloaded your index and debugged the graph. It isn't fully connected 
from what I can tell, and the searcher seems to get stuck in local minima:
   
   ```
   > Task :lucene:sandbox:TestLuceneVectorSearch.main()
   Leaf 0 has 3 layers
   Leaf 0 has 10832 documents
   Graph level=2 size=16, Fanout min=1, mean=3.63, max=8
   %   0  10  20  30  40  50  60  70  80  90 100
       0   1   2   2   2   3   3   4   5   6   8
   Graph level=1 size=338, Fanout min=1, mean=5.81, max=16
   %   0  10  20  30  40  50  60  70  80  90 100
       0   1   1   2   3   5   6   8  10  14  16
   Graph level=0 size=5179, Fanout min=1, mean=4.02, max=32
   %   0  10  20  30  40  50  60  70  80  90 100
       0   1   1   1   1   1   1   4   7  11  32
   Graph level=2 size=16, connectedness=1.00
   Graph level=1 size=338, connectedness=0.99
   Graph level=0 size=5179, connectedness=0.96
   Similarity query results: 
   ```
   
   With connecting disconnected components as @msokolov 's new PR work does:
   
   ```
   Graph level=2 size=16, Fanout min=1, mean=3.63, max=8
   %   0  10  20  30  40  50  60  70  80  90 100
       0   1   2   2   2   3   3   4   5   6   8
   Graph level=1 size=338, Fanout min=1, mean=5.81, max=16
   %   0  10  20  30  40  50  60  70  80  90 100
       0   1   1   2   3   5   6   8  10  14  16
   Graph level=0 size=5179, Fanout min=1, mean=4.15, max=32
   %   0  10  20  30  40  50  60  70  80  90 100
       0   1   1   1   1   1   2   4   7  11  32
   Graph level=2 size=16, connectedness=1.00
   Graph level=1 size=338, connectedness=0.99
   Graph level=0 size=5179, connectedness=1.00
   ```
   
   Now, even with this, we get stuck in local minima with `k` so small & 
default graph building params. 
   
   So, I experimented with a couple (here is the code I used, note, I manually 
patched to use @msokolov 's connectedness improvements):
   ```
   for (int m = 16; m <= 64; m += 16) {
           for (int b = 100; b <= 250; b+= 50) {
             // lets rebuild the index no nestedness
             // Delete the new Dir if it exists
             System.out.println("-----------------------------------");
             System.out.println("Rebuilding the index with m=" + m + " and b=" 
+ b);
             try (Directory newDir =
                    FSDirectory.open(Paths.get("/path/to/different/dir"))) {
               try (IndexWriter writer = new IndexWriter(newDir, new 
IndexWriterConfig().setCodec(
                 new Lucene99Codec() {
                   @Override
                   public KnnVectorsFormat getKnnVectorsFormatForField(String 
field) {
                     return new Lucene99HnswVectorsFormat(16, 100);
                   }
                 }
               ).setOpenMode(IndexWriterConfig.OpenMode.CREATE))) {
                 FloatVectorValues vectorValues = 
leafReader.getFloatVectorValues("image-embeddings");
                 while (vectorValues.nextDoc() != NO_MORE_DOCS) {
                   Document document = new Document();
                   document.add(new KnnFloatVectorField("image-embeddings", 
vectorValues.vectorValue(), VectorSimilarityFunction.DOT_PRODUCT));
                   writer.addDocument(document);
                 }
                 writer.commit();
                 writer.forceMerge(1);
               }
               try (DirectoryReader newDirReader = 
DirectoryReader.open(newDir)) {
                 IndexSearcher newSearcher = new IndexSearcher(newDirReader);
                 LeafReaderContext newContext = newDirReader.leaves().get(0);
   
                 LeafReader newLeaf = newContext.reader();
                 KnnVectorsReader newVectors =
                   ((PerFieldKnnVectorsFormat.FieldsReader) ((CodecReader) 
newLeaf).getVectorReader())
                     .getFieldReader("image-embeddings");
                 HnswGraph newknnValues = ((Lucene99HnswVectorsReader) 
newVectors).getGraph("image-embeddings");
                 System.out.printf("New Leaf %d has %d layers\n", 
newContext.ord, knnValues.numLevels());
                 System.out.printf("New Leaf %d has %d documents\n", 
newContext.ord, newLeaf.maxDoc());
                 printGraphFanout(newknnValues, newLeaf.maxDoc());
                 printGraphConnectedNess(newknnValues);
                 System.out.println("Rebuilt index results: \n");
                 for (int i = 1; i <= 15; i++) {
                   System.out.println("Top 5 results with k=" + i * 5 + " ");
                   topDocs = newSearcher.search(new 
KnnFloatVectorQuery("image-embeddings", queryVector, i * 5), 5);
                   printResults(newSearcher, topDocs);
                 }
               }
             }
           }
         }
   ```
   
   Even with all this, I couldn't get the searcher to break out of the local 
minima. I wonder if our connectedness check accounts for the directional nature 
of the graph... I will debug further as I can.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to