[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support

via GitHub Wed, 26 Jul 2023 08:33:02 -0700


jpountz commented on code in PR #12434:
URL: https://github.com/apache/lucene/pull/12434#discussion_r1275147272



##########
lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java:
##########
@@ -78,8 +79,16 @@ public KnnFloatVectorQuery(String field, float[] target, int 
k, Query filter) {
   @Override
   protected TopDocs approximateSearch(LeafReaderContext context, Bits 
acceptDocs, int visitedLimit)
       throws IOException {
+    FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field);
+    if (fi == null || fi.getVectorDimension() == 0) {
+      // The field does not exist or does not index vectors
+      return NO_RESULTS;
+    }
+    int k = Math.min(this.k, 
context.reader().getFloatVectorValues(fi.name).size());
     TopDocs results =
-        context.reader().searchNearestVectors(field, target, k, acceptDocs, 
visitedLimit);
+        context
+            .reader()
+            .searchNearestVectors(field, target, new TopKnnCollector(k, 
visitedLimit), acceptDocs);

Review Comment:
   like on the byte vector query, we could push this logic of checking for 
field existence and reducing `k` to `LeafReader`?



##########
lucene/core/src/java/org/apache/lucene/search/KnnCollector.java:
##########
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * KnnCollector is a knn collector used for gathering kNN results and 
providing topDocs from the
+ * gathered neighbors
+ */
+public interface KnnCollector {
+
+  /**
+   * If search visits too many documents, the results collector will terminate 
early. Usually, this
+   * is due to some restricted filter on the document set.
+   *
+   * <p>When collection is earlyTerminated, the results are not a correct 
representation of k
+   * nearest neighbors.
+   *
+   * @return is the current result set marked as incomplete?
+   */
+  boolean earlyTerminated();
+
+  /**
+   * @param count increments the visited vector count, must be greater than 0.
+   */
+  void incVisitedCount(int count);
+
+  /**
+   * @return the current visited vector count
+   */
+  long visitedCount();
+
+  /**
+   * @return the visited vector limit
+   */
+  long visitLimit();
+
+  /**
+   * @return the expected number of collected results
+   */
+  int k();
+
+  /**
+   * Collect the provided docId and include in the result set.
+   *
+   * @param docId of the vector to collect
+   * @param similarity its calculated similarity
+   * @return true if the vector is collected
+   */
+  boolean collect(int docId, float similarity);
+
+  /**
+   * This method is utilized during search to ensure only competitive results 
are explored.
+   *
+   * <p>Consequently, if this results collector wants to collect `k` results, 
this should return
+   * {@link Float#NEGATIVE_INFINITY} when not full.
+   *
+   * <p>When full, the minimum score should be returned.
+   *
+   * @return the current minimum competitive similarity in the collection
+   */
+  float minCompetitiveSimilarity();
+
+  /**
+   * This drains the collected nearest kNN results and returns them in a new 
{@link TopDocs}
+   * collection, ordered by score descending

Review Comment:
   "drains" suggests it already, but for extra clarity, maybe mention that this 
is a destructive operation



##########
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##########
@@ -79,14 +80,13 @@ protected KnnVectorsReader() {}
    *
    * @param field the vector field to search
    * @param target the vector-valued query
-   * @param k the number of docs to return
+   * @param knnResults a KnnResults collector and relevant settings for 
gathering vector results
    * @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
    *     if they are all allowed to match.
-   * @param visitedLimit the maximum number of nodes that the search is 
allowed to visit
    * @return the k nearest neighbor documents, along with their 
(similarity-specific) scores.
    */
   public abstract TopDocs search(
-      String field, float[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException;
+      String field, float[] target, KnnResults knnResults, Bits acceptDocs) 
throws IOException;

Review Comment:
   Not a strong opinion, but when there is a destructive operation like 
`KnnResults#topDocs`, I prefer that we make the same component responsible for 
creating the object and calling the destructive method, while here, the user 
would be responsible for creating the `KnnResults` object and then the codec is 
responsible for calling `KnnResults#topDocs`. If it returned void, the caller 
would be responsible for both creating the object and calling the destructive 
method.



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/TopKnnCollector.java:
##########
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import org.apache.lucene.search.AbstractKnnCollector;
+import org.apache.lucene.search.ScoreDoc;
+import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.search.TotalHits;
+
+/**
+ * TopKnnCollector is a specific KnnResults. A minHeap is used to keep track 
of the currently

Review Comment:
   ```suggestion
    * TopKnnCollector is a specific KnnCollector. A minHeap is used to keep 
track of the currently
   ```



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/TopKnnCollector.java:
##########
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;

Review Comment:
   It should be in oal.search I believe as it's not specific to our hnsw 
implementation?



##########
lucene/core/src/java/org/apache/lucene/search/KnnByteVectorQuery.java:
##########
@@ -77,8 +78,16 @@ public KnnByteVectorQuery(String field, byte[] target, int 
k, Query filter) {
   @Override
   protected TopDocs approximateSearch(LeafReaderContext context, Bits 
acceptDocs, int visitedLimit)
       throws IOException {
+    FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field);
+    if (fi == null || fi.getVectorDimension() == 0) {
+      // The field does not exist or does not index vectors
+      return NO_RESULTS;
+    }
+    int k = Math.min(this.k, 
context.reader().getByteVectorValues(fi.name).size());

Review Comment:
   Should we have this logic in the  `LeafReader#searchNearestVectors` that 
takes an `int k`, and call it here instead of the expert method that takes a 
`KnnCollector`?



##########
lucene/core/src/java/org/apache/lucene/search/AbstractKnnCollector.java:
##########
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+/**
+ * AbstractKnnCollector is the default implementation for a knn collector used 
for gathering kNN
+ * results and providing topDocs from the gathered neighbors
+ */
+public abstract class AbstractKnnCollector implements KnnCollector {
+
+  private long visitedCount;
+  private final long visitLimit;
+  private final int k;
+
+  protected AbstractKnnCollector(int k, long visitLimit) {
+    this.visitLimit = visitLimit;
+    this.k = k;
+  }
+
+  /**
+   * If search visits too many documents, the results collector will terminate 
early. Usually, this
+   * is due to some restricted filter on the document set.
+   *
+   * <p>When collection is earlyTerminated, the results are not a correct 
representation of k
+   * nearest neighbors.
+   *
+   * @return is the current result set marked as incomplete?
+   */
+  @Override
+  public final boolean earlyTerminated() {
+    return visitedCount >= visitLimit;
+  }
+
+  /**
+   * @param count increments the visited vector count, must be greater than 0.
+   */
+  @Override
+  public final void incVisitedCount(int count) {
+    assert count > 0;
+    this.visitedCount += count;
+  }
+
+  /**
+   * @return the current visited vector count
+   */
+  @Override
+  public final long visitedCount() {
+    return visitedCount;
+  }
+
+  /**
+   * @return the visited vector limit
+   */
+  @Override
+  public final long visitLimit() {
+    return visitLimit;
+  }
+
+  /**
+   * @return the expected number of collected results
+   */
+  @Override
+  public final int k() {
+    return k;
+  }
+
+  /**
+   * Collect the provided docId and include in the result set.
+   *
+   * @param docId of the vector to collect
+   * @param similarity its calculated similarity
+   * @return true if the vector is collected
+   */
+  @Override
+  public abstract boolean collect(int docId, float similarity);
+
+  /**
+   * This method is utilized during search to ensure only competitive results 
are explored.
+   *
+   * <p>Consequently, if this results collector wants to collect `k` results, 
this should return
+   * {@link Float#NEGATIVE_INFINITY} when not full.
+   *
+   * <p>When full, the minimum score should be returned.
+   *
+   * @return the current minimum competitive similarity in the collection
+   */
+  @Override
+  public abstract float minCompetitiveSimilarity();
+
+  /**
+   * This drains the collected nearest kNN results and returns them in a new 
{@link TopDocs}
+   * collection, ordered by score descending
+   *
+   * @return The collected top documents
+   */

Review Comment:
   We don't need these docs here as they're already on the interface?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #12434: Add ParentJoin KNN support

Reply via email to