jtibshirani commented on a change in pull request #416:
URL: https://github.com/apache/lucene/pull/416#discussion_r740706179
##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -99,32 +121,72 @@ public static NeighborQueue search(
Bits acceptOrds,
SplittableRandom random)
Review comment:
I think this `SplittableRandom` is unused.
##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -99,32 +121,72 @@ public static NeighborQueue search(
Bits acceptOrds,
SplittableRandom random)
throws IOException {
+
int size = graphValues.size();
+ int boundedNumSeed = Math.max(topK, Math.min(numSeed, 2 * size));
+ NeighborQueue results;
+
+ int[] eps = new int[] {graphValues.entryNode()};
+ for (int level = graphValues.numLevels() - 1; level >= 1; level--) {
+ results =
+ HnswGraph.searchLevel(
Review comment:
Tiny comment, we can remove the `HnswGraph` qualifier.
##########
File path:
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java
##########
@@ -205,6 +215,43 @@ private FieldEntry readField(DataInput input) throws
IOException {
return new FieldEntry(input, similarityFunction);
}
+ private void fillGraphNodesAndOffsetsByLevel() throws IOException {
+ for (FieldEntry entry : fields.values()) {
+ IndexInput input =
Review comment:
If I understand correctly, the graph index file is only used to populate
these data structures on `FieldEntry` (the graph level information). It feels a
bit surprising that `FieldEntry` is constructed across two different files
(both the metadata and graph index files). It also means `FieldEntry` isn't
immutable.
hmm, I'm not sure what typically belongs in a metadata file. I assume it
doesn't make sense to move the graph index information there, alongside the ord
-> doc mapping. Maybe we could keep the file as-is but create a new class to
hold the graph index, something like `GraphLevels` ?
##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
public final class HnswGraph extends KnnGraphValues {
private final int maxConn;
+ private int numLevels; // the current number of levels in the graph
+ private int entryNode; // the current graph entry node on the top level
- // Each entry lists the top maxConn neighbors of a node. The nodes
correspond to vectors added to
- // HnswBuilder, and the
- // node values are the ordinals of those vectors.
- private final List<NeighborArray> graph;
+ // Nodes by level expressed as the level 0's nodes' ordinals.
+ // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+ private final List<int[]> nodesByLevel;
+
+ // graph is a list of graph levels.
+ // Each level is represented as List<NeighborArray> – nodes' connections on
this level.
+ // Each entry in the list has the top maxConn neighbors of a node. The nodes
correspond to vectors
+ // added to HnswBuilder, and the node values are the ordinals of those
vectors.
+ // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+ private final List<List<NeighborArray>> graph;
// KnnGraphValues iterator members
private int upto;
private NeighborArray cur;
- HnswGraph(int maxConn) {
- graph = new ArrayList<>();
- // Typically with diversity criteria we see nodes not fully occupied;
average fanout seems to be
- // about 1/2 maxConn. There is some indexing time penalty for
under-allocating, but saves RAM
- graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+ HnswGraph(int maxConn, int levelOfFirstNode) {
this.maxConn = maxConn;
+ this.numLevels = levelOfFirstNode + 1;
+ this.graph = new ArrayList<>(numLevels);
+ this.entryNode = 0;
+ for (int i = 0; i < numLevels; i++) {
+ graph.add(new ArrayList<>());
+ // Typically with diversity criteria we see nodes not fully occupied;
+ // average fanout seems to be about 1/2 maxConn.
+ // There is some indexing time penalty for under-allocating, but saves
RAM
+ graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));
Review comment:
I am a little confused why we are careful not to overallocate here, but
in `addNode` we do `new NeighborArray(maxConn + 1)));`? I see you maintained
the existing behavior, so not critical to address in this PR, I was just
curious about it.
##########
File path: lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java
##########
@@ -32,25 +33,41 @@
protected KnnGraphValues() {}
/**
- * Move the pointer to exactly {@code target}, the id of a node in the
graph. After this method
+ * Move the pointer to exactly the given {@code level}'s {@code target}.
After this method
* returns, call {@link #nextNeighbor()} to return successive (ordered)
connected node ordinals.
*
- * @param target must be a valid node in the graph, ie. ≥ 0 and <
{@link
+ * @param level level of the graph
+ * @param target ordinal of a node in the graph, must be ≥ 0 and <
{@link
* VectorValues#size()}.
*/
- public abstract void seek(int target) throws IOException;
+ public abstract void seek(int level, int target) throws IOException;
/** Returns the number of nodes in the graph */
public abstract int size();
/**
* Iterates over the neighbor list. It is illegal to call this method after
it returns
- * NO_MORE_DOCS without calling {@link #seek(int)}, which resets the
iterator.
+ * NO_MORE_DOCS without calling {@link #seek(int, int)}, which resets the
iterator.
*
* @return a node ordinal in the graph, or NO_MORE_DOCS if the iteration is
complete.
*/
public abstract int nextNeighbor() throws IOException;
+ /** Returns the number of levels of the graph */
+ public abstract int numLevels() throws IOException;
+
+ /** Returns graph's entry point on the top level * */
+ public abstract int entryNode() throws IOException;
+
+ /**
+ * Get all nodes on a given level as node 0th ordinals
+ *
+ * @param level level for which to get all nodes
+ * @return an iterator over nodes where {@code nextDoc} returns a next node
ordinal
+ */
+ // TODO: return a more suitable iterator over nodes than DocIdSetIterator
Review comment:
hmm, I guess we never addressed this TODO. I don't have a good
alternative suggestion, maybe others do.
##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
##########
@@ -56,31 +59,50 @@
public final class HnswGraph extends KnnGraphValues {
private final int maxConn;
+ private int numLevels; // the current number of levels in the graph
+ private int entryNode; // the current graph entry node on the top level
- // Each entry lists the top maxConn neighbors of a node. The nodes
correspond to vectors added to
- // HnswBuilder, and the
- // node values are the ordinals of those vectors.
- private final List<NeighborArray> graph;
+ // Nodes by level expressed as the level 0's nodes' ordinals.
+ // As level 0 contains all nodes, nodesByLevel.get(0) is null.
+ private final List<int[]> nodesByLevel;
+
+ // graph is a list of graph levels.
+ // Each level is represented as List<NeighborArray> – nodes' connections on
this level.
+ // Each entry in the list has the top maxConn neighbors of a node. The nodes
correspond to vectors
+ // added to HnswBuilder, and the node values are the ordinals of those
vectors.
+ // Thus, on all levels, neighbors expressed as the level 0's nodes' ordinals.
+ private final List<List<NeighborArray>> graph;
// KnnGraphValues iterator members
private int upto;
private NeighborArray cur;
- HnswGraph(int maxConn) {
- graph = new ArrayList<>();
- // Typically with diversity criteria we see nodes not fully occupied;
average fanout seems to be
- // about 1/2 maxConn. There is some indexing time penalty for
under-allocating, but saves RAM
- graph.add(new NeighborArray(Math.max(32, maxConn / 4)));
+ HnswGraph(int maxConn, int levelOfFirstNode) {
this.maxConn = maxConn;
+ this.numLevels = levelOfFirstNode + 1;
+ this.graph = new ArrayList<>(numLevels);
+ this.entryNode = 0;
+ for (int i = 0; i < numLevels; i++) {
+ graph.add(new ArrayList<>());
+ // Typically with diversity criteria we see nodes not fully occupied;
+ // average fanout seems to be about 1/2 maxConn.
+ // There is some indexing time penalty for under-allocating, but saves
RAM
+ graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));
+ }
+
+ this.nodesByLevel = new ArrayList<>(numLevels);
+ nodesByLevel.add(null); // we don't need this for 0th level, as it
contains all nodes
+ for (int l = 1; l < numLevels; l++) {
+ nodesByLevel.add(new int[] {0});
+ }
}
/**
- * Searches for the nearest neighbors of a query vector.
+ * Searches HNSW graph for the nearest neighbors of a query vector.
*
* @param query search query vector
* @param topK the number of nodes to be returned
- * @param numSeed the size of the queue maintained while searching, and
controls the number of
- * random entry points to sample
+ * @param numSeed the size of the queue maintained while searching
Review comment:
Could we remove this parameter `numSeed`? It seems particularly
out-of-place now that we don't sample random entry points.
I think it'd be a nice simplification to only have one parameter `topK`. I
can't think of a case where callers would set `numSeed` to be something
different from `topK`.
##########
File path:
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsFormat.java
##########
@@ -69,10 +99,12 @@
static final String META_CODEC_NAME = "Lucene90HnswVectorsFormatMeta";
static final String VECTOR_DATA_CODEC_NAME = "Lucene90HnswVectorsFormatData";
- static final String VECTOR_INDEX_CODEC_NAME =
"Lucene90HnswVectorsFormatIndex";
+ static final String GRAPH_INDEX_CODEC_NAME =
"Lucene90HnswVectorsFormatGraphIndex";
Review comment:
Tiny comment, I always have to check to remind myself what these files
contain. Maybe names like "GraphLayers" and "GraphNeighbors" would be more
descriptive?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]