[GitHub] [lucene] jtibshirani commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

GitBox Mon, 01 Nov 2021 21:45:27 -0700


jtibshirani commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r740706179




##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -99,32 +121,72 @@ public static NeighborQueue search(
       Bits acceptOrds,
       SplittableRandom random)

Review comment:
       I think this `SplittableRandom` is unused.

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -99,32 +121,72 @@ public static NeighborQueue search(
       Bits acceptOrds,
       SplittableRandom random)
       throws IOException {
+
     int size = graphValues.size();
+    int boundedNumSeed = Math.max(topK, Math.min(numSeed, 2 * size));
+    NeighborQueue results;
+
+    int[] eps = new int[] {graphValues.entryNode()};
+    for (int level = graphValues.numLevels() - 1; level >= 1; level--) {
+      results =
+          HnswGraph.searchLevel(

Review comment:
       Tiny comment, we can remove the `HnswGraph` qualifier.

##########
File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -205,6 +215,43 @@ private FieldEntry readField(DataInput input) throws 
IOException {
     return new FieldEntry(input, similarityFunction);
   }
 
+  private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+    for (FieldEntry entry : fields.values()) {
+      IndexInput input =

Review comment:
       If I understand correctly, the graph index file is only used to populate 
these data structures on `FieldEntry` (the graph level information). It feels a 
bit surprising that `FieldEntry` is constructed across two different files 
(both the metadata and graph index files). It also means `FieldEntry` isn't 
immutable.
   
   hmm, I'm not sure what typically belongs in a metadata file. I assume it 
doesn't make sense to move the graph index information there, alongside the ord 
-> doc mapping. Maybe we could keep the file as-is but create a new class to 
hold the graph index, something like `GraphLevels` ?

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes 
correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on 
this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes 
correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those 
vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; 
average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for 
under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves 
RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));

Review comment:
       I am a little confused why we are careful not to overallocate here, but 
in `addNode` we do `new NeighborArray(maxConn + 1)));`? I see you maintained 
the existing behavior, so not critical to address in this PR, I was just 
curious about it.

##########
File path: lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java
##########
@@ -32,25 +33,41 @@
   protected KnnGraphValues() {}
 
   /**
-   * Move the pointer to exactly {@code target}, the id of a node in the 
graph. After this method
+   * Move the pointer to exactly the given {@code level}'s {@code target}. 
After this method
    * returns, call {@link #nextNeighbor()} to return successive (ordered) 
connected node ordinals.
    *
-   * @param target must be a valid node in the graph, ie. &ge; 0 and &lt; 
{@link
+   * @param level level of the graph
+   * @param target ordinal of a node in the graph, must be &ge; 0 and &lt; 
{@link
    *     VectorValues#size()}.
    */
-  public abstract void seek(int target) throws IOException;
+  public abstract void seek(int level, int target) throws IOException;
 
   /** Returns the number of nodes in the graph */
   public abstract int size();
 
   /**
    * Iterates over the neighbor list. It is illegal to call this method after 
it returns
-   * NO_MORE_DOCS without calling {@link #seek(int)}, which resets the 
iterator.
+   * NO_MORE_DOCS without calling {@link #seek(int, int)}, which resets the 
iterator.
    *
    * @return a node ordinal in the graph, or NO_MORE_DOCS if the iteration is 
complete.
    */
   public abstract int nextNeighbor() throws IOException;
 
+  /** Returns the number of levels of the graph */
+  public abstract int numLevels() throws IOException;
+
+  /** Returns graph's entry point on the top level * */
+  public abstract int entryNode() throws IOException;
+
+  /**
+   * Get all nodes on a given level as node 0th ordinals
+   *
+   * @param level level for which to get all nodes
+   * @return an iterator over nodes where {@code nextDoc} returns a next node 
ordinal
+   */
+  // TODO: return a more suitable iterator over nodes than DocIdSetIterator

Review comment:
       hmm, I guess we never addressed this TODO. I don't have a good 
alternative suggestion, maybe others do.

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
 public final class HnswGraph extends KnnGraphValues {
 
   private final int maxConn;
+  private int numLevels; // the current number of levels in the graph
+  private int entryNode; // the current graph entry node on the top level
 
-  // Each entry lists the top maxConn neighbors of a node. The nodes 
correspond to vectors added to
-  // HnswBuilder, and the
-  // node values are the ordinals of those vectors.
-  private final List<NeighborArray> graph;
+  // Nodes by level expressed as the level 0's nodes' ordinals.
+  // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+  private final List<int[]> nodesByLevel;
+
+  // graph is a list of graph levels.
+  // Each level is represented as List<NeighborArray> – nodes' connections on 
this level.
+  // Each entry in the list has the top maxConn neighbors of a node. The nodes 
correspond to vectors
+  // added to HnswBuilder, and the node values are the ordinals of those 
vectors.
+  // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+  private final List<List<NeighborArray>> graph;
 
   // KnnGraphValues iterator members
   private int upto;
   private NeighborArray cur;
 
-  HnswGraph(int maxConn) {
-    graph = new ArrayList<>();
-    // Typically with diversity criteria we see nodes not fully occupied; 
average fanout seems to be
-    // about 1/2 maxConn. There is some indexing time penalty for 
under-allocating, but saves RAM
-    graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+  HnswGraph(int maxConn, int levelOfFirstNode) {
     this.maxConn = maxConn;
+    this.numLevels = levelOfFirstNode + 1;
+    this.graph = new ArrayList<>(numLevels);
+    this.entryNode = 0;
+    for (int i = 0; i < numLevels; i++) {
+      graph.add(new ArrayList<>());
+      // Typically with diversity criteria we see nodes not fully occupied;
+      // average fanout seems to be about 1/2 maxConn.
+      // There is some indexing time penalty for under-allocating, but saves 
RAM
+      graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));
+    }
+
+    this.nodesByLevel = new ArrayList<>(numLevels);
+    nodesByLevel.add(null); // we don't need this for 0th level, as it 
contains all nodes
+    for (int l = 1; l < numLevels; l++) {
+      nodesByLevel.add(new int[] {0});
+    }
   }
 
   /**
-   * Searches for the nearest neighbors of a query vector.
+   * Searches HNSW graph for the nearest neighbors of a query vector.
    *
    * @param query search query vector
    * @param topK the number of nodes to be returned
-   * @param numSeed the size of the queue maintained while searching, and 
controls the number of
-   *     random entry points to sample
+   * @param numSeed the size of the queue maintained while searching

Review comment:
       Could we remove this parameter `numSeed`? It seems particularly 
out-of-place now that we don't sample random entry points.
   
   I think it'd be a nice simplification to only have one parameter `topK`. I 
can't think of a case where callers would set `numSeed` to be something 
different from `topK`.

##########
File path: 
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsFormat.java
##########
@@ -69,10 +99,12 @@
 
   static final String META_CODEC_NAME = "Lucene90HnswVectorsFormatMeta";
   static final String VECTOR_DATA_CODEC_NAME = "Lucene90HnswVectorsFormatData";
-  static final String VECTOR_INDEX_CODEC_NAME = 
"Lucene90HnswVectorsFormatIndex";
+  static final String GRAPH_INDEX_CODEC_NAME = 
"Lucene90HnswVectorsFormatGraphIndex";

Review comment:
       Tiny comment, I always have to check to remind myself what these files 
contain. Maybe names like "GraphLayers" and "GraphNeighbors" would be more 
descriptive?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #416: LUCENE-10054 Make HnswGraph hierarchical

Reply via email to