[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

GitBox Wed, 17 Nov 2021 14:29:36 -0800


mayya-sharipova commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r751687763




##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes 
correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on 
this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes 
correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those 
vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; 
average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for 
under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves 
RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));
+    }
+
+    this.nodesByLevel = new ArrayList<>(numLevels);
+    nodesByLevel.add(null); // we don't need this for 0th level, as it 
contains all nodes
+    for (int l = 1; l < numLevels; l++) {
+      nodesByLevel.add(new int[] {0});
+    }
   }
 
   /**
-   * Searches for the nearest neighbors of a query vector.
+   * Searches HNSW graph for the nearest neighbors of a query vector.
    *
    * @param query search query vector
    * @param topK the number of nodes to be returned
-   * @param numSeed the size of the queue maintained while searching, and 
controls the number of
-   *     random entry points to sample
+   * @param numSeed the size of the queue maintained while searching

Review comment:
       @jtibshirani  Thanks for the comment, I have thought long about this, 
and I think we should keep it until we have a better strategy.
   
   Currently, we search HNSW graphs  through `KnnVectorQuery` providing k as 
`numCandidates` (`numSeed`  or `fanout`), as this query doesn't support  
`numCandidates`  parameter.  I think it is not optimal that instead of 
returning let's say top 10 docs from every segment (or shard in a distributed 
search), we return 100 (200, 800 etc).
   
   Also the signature of `KnnVectorQuery` and `HnswGraph::search` need to be 
updated to reflect the fact that  `k` is indeed the number of candidates we are 
considering. 
   
   May be can discuss this outside this PR, whether and where it makes sense to 
have `numCandidates`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Reply via email to