jimczi commented on code in PR #13288:
URL: https://github.com/apache/lucene/pull/13288#discussion_r1560095226


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsWriter.java:
##########
@@ -71,7 +72,9 @@ public final class Lucene99FlatVectorsWriter extends 
FlatVectorsWriter {
   private final List<FieldWriter<?>> fields = new ArrayList<>();
   private boolean finished;
 
-  public Lucene99FlatVectorsWriter(SegmentWriteState state) throws IOException 
{
+  public Lucene99FlatVectorsWriter(SegmentWriteState state, FlatVectorsScorer 
scorer)
+      throws IOException {
+    super(scorer);

Review Comment:
   Shouldn't we use the protected `vectorScorers` in 
[`mergeOneFieldToIndex`](https://github.com/apache/lucene/pull/13288/files#diff-f034f4648651d6e214af2bc3af71eeaae15a57d18b34783efad3eb48673206ddR254)?
 



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapRandomAccessVectorValues.java:
##########
@@ -0,0 +1,34 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene95;
+
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.util.hnsw.RandomAccessVectorValues;
+
+/**
+ * Provides random access to vectors off-heap by the vector ordinal.
+ *
+ * @lucene.experimental
+ */
+public interface OffHeapRandomAccessVectorValues extends 
RandomAccessVectorValues {

Review Comment:
   nit: I am not a big fan of this off/on heap nomenclature. In this particular 
context I think a name like `IndexInputRandomAccessVectorValues` would be less 
confusing. Another option to avoid adding another interface would be to add the 
`getSlice` directly in `RandomAccessVectorValues` and to return null in case 
the index input is not accessible? 



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/bitvectors/HnswBitVectorsFormat.java:
##########
@@ -0,0 +1,211 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.bitvectors;
+
+import static 
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat.DEFAULT_BEAM_WIDTH;
+import static 
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat.DEFAULT_MAX_CONN;
+import static 
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat.DEFAULT_NUM_MERGE_WORKER;
+import static 
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat.MAXIMUM_BEAM_WIDTH;
+import static 
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat.MAXIMUM_MAX_CONN;
+
+import java.io.IOException;
+import java.util.concurrent.ExecutorService;
+import org.apache.lucene.codecs.KnnFieldVectorsWriter;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.hnsw.FlatFieldVectorsWriter;
+import org.apache.lucene.codecs.hnsw.FlatVectorsFormat;
+import org.apache.lucene.codecs.hnsw.FlatVectorsWriter;
+import org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsFormat;
+import org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat;
+import org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader;
+import org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsWriter;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.index.Sorter;
+import org.apache.lucene.index.VectorEncoding;
+import org.apache.lucene.search.TaskExecutor;
+import org.apache.lucene.util.hnsw.CloseableRandomVectorScorerSupplier;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Encodes bit vector values into an associated graph connecting the documents 
having values. The
+ * graph is used to power HNSW search. The format consists of two files, and 
uses {@link
+ * Lucene99FlatVectorsFormat} to store the actual vectors, but with a custom 
scorer implementation:
+ * For details on graph storage and file extensions, see {@link 
Lucene99HnswVectorsFormat}.
+ *
+ * @lucene.experimental
+ */
+public final class HnswBitVectorsFormat extends KnnVectorsFormat {

Review Comment:
   We can maybe start with a simpler `FlatBitVectorsFormat`?



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/bitvectors/OnHeapFlatBitVectorsScorer.java:
##########
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.bitvectors;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.hnsw.FlatVectorsScorer;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.VectorUtil;
+import org.apache.lucene.util.hnsw.RandomAccessVectorValues;
+import org.apache.lucene.util.hnsw.RandomVectorScorer;
+import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier;
+
+/** A bit vector scorer for scoring byte vectors. */
+public class OnHeapFlatBitVectorsScorer implements FlatVectorsScorer {

Review Comment:
   nit: On heap is misleading here, `FlatBitVectorsScorer` should be enough?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java:
##########
@@ -556,14 +556,15 @@ static FieldWriter<?> create(FieldInfo fieldInfo, int M, 
int beamWidth, InfoStre
       this.fieldInfo = fieldInfo;
       this.docsWithField = new DocsWithFieldSet();
       vectors = new ArrayList<>();
-      RAVectorValues<T> raVectors = new RAVectorValues<>(vectors, 
fieldInfo.getVectorDimension());
       RandomVectorScorerSupplier scorerSupplier =
           switch (fieldInfo.getVectorEncoding()) {
             case BYTE -> RandomVectorScorerSupplier.createBytes(

Review Comment:
   Should it use the `flatVectorWriter#vectorsScorer ` to create these 
suppliers? 



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/bitvectors/HnswBitVectorsFormat.java:
##########
@@ -0,0 +1,211 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.codecs.bitvectors;
+
+import static 
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat.DEFAULT_BEAM_WIDTH;
+import static 
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat.DEFAULT_MAX_CONN;
+import static 
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat.DEFAULT_NUM_MERGE_WORKER;
+import static 
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat.MAXIMUM_BEAM_WIDTH;
+import static 
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat.MAXIMUM_MAX_CONN;
+
+import java.io.IOException;
+import java.util.concurrent.ExecutorService;
+import org.apache.lucene.codecs.KnnFieldVectorsWriter;
+import org.apache.lucene.codecs.KnnVectorsFormat;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.codecs.KnnVectorsWriter;
+import org.apache.lucene.codecs.hnsw.FlatFieldVectorsWriter;
+import org.apache.lucene.codecs.hnsw.FlatVectorsFormat;
+import org.apache.lucene.codecs.hnsw.FlatVectorsWriter;
+import org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsFormat;
+import org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat;
+import org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader;
+import org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsWriter;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.SegmentReadState;
+import org.apache.lucene.index.SegmentWriteState;
+import org.apache.lucene.index.Sorter;
+import org.apache.lucene.index.VectorEncoding;
+import org.apache.lucene.search.TaskExecutor;
+import org.apache.lucene.util.hnsw.CloseableRandomVectorScorerSupplier;
+import org.apache.lucene.util.hnsw.HnswGraph;
+
+/**
+ * Encodes bit vector values into an associated graph connecting the documents 
having values. The
+ * graph is used to power HNSW search. The format consists of two files, and 
uses {@link
+ * Lucene99FlatVectorsFormat} to store the actual vectors, but with a custom 
scorer implementation:
+ * For details on graph storage and file extensions, see {@link 
Lucene99HnswVectorsFormat}.
+ *
+ * @lucene.experimental
+ */
+public final class HnswBitVectorsFormat extends KnnVectorsFormat {
+
+  /**
+   * Controls how many of the nearest neighbor candidates are connected to the 
new node. Defaults to
+   * {@link Lucene99HnswVectorsFormat#DEFAULT_MAX_CONN}. See {@link HnswGraph} 
for more details.
+   */
+  private final int maxConn;
+
+  /**
+   * The number of candidate neighbors to track while searching the graph for 
each newly inserted
+   * node. Defaults to {@link Lucene99HnswVectorsFormat#DEFAULT_BEAM_WIDTH}. 
See {@link HnswGraph}
+   * for details.
+   */
+  private final int beamWidth;
+
+  /** The format for storing, reading, merging vectors on disk */
+  private final FlatVectorsFormat flatVectorsFormat;
+
+  private final int numMergeWorkers;
+  private final TaskExecutor mergeExec;
+
+  /** Constructs a format using default graph construction parameters */
+  public HnswBitVectorsFormat() {
+    this(DEFAULT_MAX_CONN, DEFAULT_BEAM_WIDTH, DEFAULT_NUM_MERGE_WORKER, null);
+  }
+
+  /**
+   * Constructs a format using the given graph construction parameters.
+   *
+   * @param maxConn the maximum number of connections to a node in the HNSW 
graph
+   * @param beamWidth the size of the queue maintained during graph 
construction.
+   */
+  public HnswBitVectorsFormat(int maxConn, int beamWidth) {
+    this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null);
+  }
+
+  /**
+   * Constructs a format using the given graph construction parameters and 
scalar quantization.
+   *
+   * @param maxConn the maximum number of connections to a node in the HNSW 
graph
+   * @param beamWidth the size of the queue maintained during graph 
construction.
+   * @param numMergeWorkers number of workers (threads) that will be used when 
doing merge. If
+   *     larger than 1, a non-null {@link ExecutorService} must be passed as 
mergeExec
+   * @param mergeExec the {@link ExecutorService} that will be used by ALL 
vector writers that are
+   *     generated by this format to do the merge
+   */
+  public HnswBitVectorsFormat(
+      int maxConn, int beamWidth, int numMergeWorkers, ExecutorService 
mergeExec) {
+    super("Lucene99HnswBitVectorsFormat");
+    if (maxConn <= 0 || maxConn > MAXIMUM_MAX_CONN) {
+      throw new IllegalArgumentException(
+          "maxConn must be positive and less than or equal to "
+              + MAXIMUM_MAX_CONN
+              + "; maxConn="
+              + maxConn);
+    }
+    if (beamWidth <= 0 || beamWidth > MAXIMUM_BEAM_WIDTH) {
+      throw new IllegalArgumentException(
+          "beamWidth must be positive and less than or equal to "
+              + MAXIMUM_BEAM_WIDTH
+              + "; beamWidth="
+              + beamWidth);
+    }
+    this.maxConn = maxConn;
+    this.beamWidth = beamWidth;
+    if (numMergeWorkers == 1 && mergeExec != null) {
+      throw new IllegalArgumentException(
+          "No executor service is needed as we'll use single thread to merge");
+    }
+    this.numMergeWorkers = numMergeWorkers;
+    if (mergeExec != null) {
+      this.mergeExec = new TaskExecutor(mergeExec);
+    } else {
+      this.mergeExec = null;
+    }
+    this.flatVectorsFormat = new Lucene99FlatVectorsFormat(new 
OnHeapFlatBitVectorsScorer());

Review Comment:
   I think we can use the `Lucene99HnswVectorsFormat` directly if we expose the 
vector scorer the same way than the `Lucene99FlatVectorsFormat`? This would 
avoid the copy and the need to wrap the reader/writer?



##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene92/OffHeapFloatVectorValues.java:
##########
@@ -28,7 +28,7 @@
 
 /** Read the vector values from the index input. This supports both iterated 
and random access. */
 abstract class OffHeapFloatVectorValues extends FloatVectorValues
-    implements RandomAccessVectorValues<float[]> {
+    implements RandomAccessVectorValues.Floats {

Review Comment:
   Not for this PR but I would like to try splitting `FloatVectorValues` and 
`RandomAccessVectorValues.Floats`. Having a single hierarchy that mixes the 
access pattern is not ideal. With the `FlatVectorFomat` in the mix we should be 
able to produce `RandomAccessVectorValues` and `FloatVectorValues` 
independently. This change should help this simplification :)



##########
lucene/core/src/java/org/apache/lucene/codecs/hnsw/OnHeapFlatVectorScorer.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.hnsw;
+
+import java.io.IOException;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.util.hnsw.RandomAccessVectorValues;
+import org.apache.lucene.util.hnsw.RandomVectorScorer;
+import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier;
+
+/** On-heap implementation of {@link FlatVectorsScorer}. */
+public class OnHeapFlatVectorScorer implements FlatVectorsScorer {

Review Comment:
   The on-heap part is not really needed, maybe `DefaultFlatVectorScorer`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to