Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

via GitHub Tue, 16 Jan 2024 03:54:51 -0800


tveasey commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1453309333



##########
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##########
@@ -27,25 +29,72 @@
  */
 public final class TopKnnCollector extends AbstractKnnCollector {
 
+  // greediness of globally non-competitive search: [0,1]
+  private static final float DEFAULT_GREEDINESS = 0.9f;
   private final NeighborQueue queue;
+  private final float greediness;
+  private final FloatHeap nonCompetitiveQueue;
+  private final FloatHeap updatesQueue;
+  private final int interval = 0x3ff; // 1023

Review Comment:
   This seems quite large to me, based on the total node visited counts in the 
Cohere results. For example, for fo = 90 we'd only refresh around twice.



##########
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##########
@@ -27,25 +29,72 @@
  */
 public final class TopKnnCollector extends AbstractKnnCollector {
 
+  // greediness of globally non-competitive search: [0,1]
+  private static final float DEFAULT_GREEDINESS = 0.9f;
   private final NeighborQueue queue;
+  private final float greediness;
+  private final FloatHeap nonCompetitiveQueue;
+  private final FloatHeap updatesQueue;
+  private final int interval = 0x3ff; // 1023
+  private final BlockingFloatHeap globalSimilarityQueue;
+  private boolean kResultsCollected = false;
+  private float cachedGlobalMinSim = Float.NEGATIVE_INFINITY;
 
   /**
    * @param k the number of neighbors to collect
    * @param visitLimit how many vector nodes the results are allowed to visit
    */
-  public TopKnnCollector(int k, int visitLimit) {
+  public TopKnnCollector(int k, int visitLimit, BlockingFloatHeap 
globalSimilarityQueue) {
     super(k, visitLimit);
+    this.greediness = DEFAULT_GREEDINESS;
     this.queue = new NeighborQueue(k, false);
+    this.globalSimilarityQueue = globalSimilarityQueue;
+
+    if (globalSimilarityQueue == null) {
+      this.nonCompetitiveQueue = null;
+      this.updatesQueue = null;
+    } else {
+      this.nonCompetitiveQueue = new FloatHeap(Math.max(1, Math.round((1 - 
greediness) * k)));
+      this.updatesQueue = new FloatHeap(k);
+    }
   }
 
   @Override
   public boolean collect(int docId, float similarity) {
-    return queue.insertWithOverflow(docId, similarity);
+    boolean localSimUpdated = queue.insertWithOverflow(docId, similarity);
+    boolean firstKResultsCollected = (kResultsCollected == false && 
queue.size() == k());
+    if (firstKResultsCollected) {
+      kResultsCollected = true;
+    }
+
+    boolean globalSimUpdated = false;
+    if (globalSimilarityQueue != null) {
+      updatesQueue.offer(similarity);
+      globalSimUpdated = nonCompetitiveQueue.offer(similarity);
+
+      if (kResultsCollected) {
+        // as we've collected k results, we can start do periodic updates with 
the global queue
+        if (firstKResultsCollected || (visitedCount & interval) == 0) {
+          cachedGlobalMinSim = 
globalSimilarityQueue.offer(updatesQueue.getHeap());

Review Comment:
   There is a subtlety: note that `firstKResultsCollected` is only true the 
first time we visit k nodes, so this is saying only refresh every 1024 
iterations thereafter. The refresh schedule is `k, k + 1024, k + 2048, ...`. 
(The || threw me initially.)
   
   As per my comment above, 1024 seems infrequent to me. (Of course you may 
have tested smaller values and determined this to be a good choice.) If we 
think it is risky sharing information too early, I would be inclined to share 
on the schedule `max(k, c1), max(k, c1) + c2, max(k, c1) + 2 * c2, ...` with 
`c1 > c2` and decouple the concerns.
   
   Also, there maybe issues with *using* information too early, in which case 
`minCompetitiveSimilarity` would need a check that `visitedCount > max(k, c1)` 
before it starts modifying `minSim`.
   
   In any case, I can see results being rather sensitive to these choices so if 
you haven't done a parameter exploration it might be worth trying a few choices.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

Reply via email to