Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

via GitHub Thu, 01 Feb 2024 11:54:04 -0800


benwtrent commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1475057976



##########
lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java:
##########
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search.knn;
+
+import java.io.IOException;
+import org.apache.lucene.search.KnnCollector;
+import org.apache.lucene.util.BitSet;
+
+/**
+ * KnnCollectorManager responsible for creating {@link KnnCollector} 
instances. Useful to create
+ * {@link KnnCollector} instances that share global state across leaves, such 
a global queue of
+ * results collected so far.
+ */
+public abstract class KnnCollectorManager<C extends KnnCollector> {

Review Comment:
   Do we need these generics? This also seems like it should be an `interface`



##########
lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java:
##########
@@ -123,7 +124,16 @@ protected TopDocs exactSearch(LeafReaderContext context, 
DocIdSetIterator accept
   }
 
   @Override
-  protected TopDocs approximateSearch(LeafReaderContext context, Bits 
acceptDocs, int visitedLimit)
+  protected KnnCollectorManager<?> getKnnCollectorManager(int k, boolean 
supportsConcurrency) {
+    return new DiversifyingNearestChildrenKnnCollectorManager(k);

Review Comment:
   If we adjust the interface, this manager could know about `BitSetProducer 
parentsFilter;` and abstract that away from this query.



##########
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##########
@@ -277,27 +273,24 @@ public final TopDocs searchNearestVectors(
    *
    * @param field the vector field to search
    * @param target the vector-valued query
-   * @param k the number of docs to return
    * @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
    *     if they are all allowed to match.
-   * @param visitedLimit the maximum number of nodes that the search is 
allowed to visit
+   * @param knnCollector collector with settings for gathering the vector 
results.
    * @return the k nearest neighbor documents, along with their 
(searchStrategy-specific) scores.
    * @lucene.experimental
    */
   public final TopDocs searchNearestVectors(
-      String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException {

Review Comment:
   same here, this shouldn't be mutated at all. 



##########
lucene/core/src/java/org/apache/lucene/search/AbstractKnnCollector.java:
##########
@@ -23,7 +23,7 @@
  */
 public abstract class AbstractKnnCollector implements KnnCollector {
 
-  private long visitedCount;
+  long visitedCount;

Review Comment:
   I think this should be protected, not package private. Only sub-classes 
should be able to read it.



##########
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##########
@@ -27,25 +29,78 @@
  */
 public final class TopKnnCollector extends AbstractKnnCollector {
 
+  // greediness of globally non-competitive search: (0,1]
+  private static final float DEFAULT_GREEDINESS = 0.9f;
+  // the local queue of the results with the highest similarities collected so 
far in the current
+  // segment

Review Comment:
   I think this should be a separate collector. Something like 
`MultiLeafTopKnnCollector`. 
   
   There is such very little code from the original collector still around, it 
seems weird to me. We should have two, one that shares information, another 
that doesn't. 
   
   This allows us to remove all the `null` values in the ctor. 



##########
lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java:
##########
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search.knn;
+
+import java.io.IOException;
+import org.apache.lucene.search.KnnCollector;
+import org.apache.lucene.util.BitSet;
+
+/**
+ * KnnCollectorManager responsible for creating {@link KnnCollector} 
instances. Useful to create
+ * {@link KnnCollector} instances that share global state across leaves, such 
a global queue of
+ * results collected so far.
+ */
+public abstract class KnnCollectorManager<C extends KnnCollector> {
+
+  /**
+   * Return a new {@link KnnCollector} instance.
+   *
+   * @param visitedLimit the maximum number of nodes that the search is 
allowed to visit
+   * @param parentBitSet the parent bitset, {@code null} if not applicable
+   */
+  public abstract C newCollector(int visitedLimit, BitSet parentBitSet) throws 
IOException;

Review Comment:
   ```suggestion
     public abstract C newCollector(int visitedLimit, LeafReaderContext 
context) throws IOException;
   ```
   
   Also, I am not even sure `visitedLimit` should be there. It seems like 
something the manager should already know about (as in this instance its 
static) and we just need to know about the context (the context is for 
`DiversifyingChildrenFloatKnnVectorQuery` so that its collector manager can 
create `BitSet parentBitSet` from its encapsulated `BitSetProducer`).
   
   I also think this method could return `null` if collection is not applicable 
for that given leaf context.
   
   
   



##########
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##########
@@ -236,27 +235,24 @@ public final PostingsEnum postings(Term term) throws 
IOException {
    *
    * @param field the vector field to search
    * @param target the vector-valued query
-   * @param k the number of docs to return
    * @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
    *     if they are all allowed to match.
-   * @param visitedLimit the maximum number of nodes that the search is 
allowed to visit
+   * @param knnCollector collector with settings for gathering the vector 
results.
    * @return the k nearest neighbor documents, along with their 
(searchStrategy-specific) scores.
    * @lucene.experimental
    */
   public final TopDocs searchNearestVectors(
-      String field, float[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException {
+      String field, float[] target, Bits acceptDocs, KnnCollector 
knnCollector) throws IOException {

Review Comment:
   I agree with Jim, this should be `String field, float[] target, int k, Bits 
acceptDocs, int visitedLimit` at least for this PR, and not use a queue. 



##########
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##########
@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws 
IOException {
       filterWeight = null;
     }
 
+    final boolean supportsConcurrency = indexSearcher.getSlices().length > 1;

Review Comment:
   > because we see speedups even in sequential run
   
   Do you mean speed ups without concurrency via sharing information? That is 
interesting, I wonder why that is.



##########
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##########
@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws 
IOException {
       filterWeight = null;
     }
 
+    final boolean supportsConcurrency = indexSearcher.getSlices().length > 1;
+    KnnCollectorManager<?> knnCollectorManager = getKnnCollectorManager(k, 
supportsConcurrency);

Review Comment:
   I think this interface should accept the `indexSearcher` as the parameter 
and not `supportsConcurrency` or `multipleLEaves`
   
   This way it can build whatever internal state it needs, this is particularly 
useful for `DiversifyingChildrenFloatKnnVectorQuery` etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

Reply via email to