vigyasharma commented on code in PR #12794: URL: https://github.com/apache/lucene/pull/12794#discussion_r1397994430
########## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ########## @@ -26,26 +26,71 @@ * @lucene.experimental */ public final class TopKnnCollector extends AbstractKnnCollector { + private static final float DEFAULT_GREEDINESS = 0.9f; private final NeighborQueue queue; + private final float greediness; + private final NeighborQueue queueg; + private final MaxScoreAccumulator globalMinSimAcc; + private boolean kResultsCollected = false; + private float cachedGlobalMinSim = Float.NEGATIVE_INFINITY; + + // greediness of globally non-competitive search: [0,1] /** * @param k the number of neighbors to collect * @param visitLimit how many vector nodes the results are allowed to visit + * @param globalMinSimAcc the global minimum competitive similarity tracked across all segments */ - public TopKnnCollector(int k, int visitLimit) { + public TopKnnCollector(int k, int visitLimit, MaxScoreAccumulator globalMinSimAcc) { + super(k, visitLimit); + this.greediness = DEFAULT_GREEDINESS; + this.queue = new NeighborQueue(k, false); + int queuegSize = Math.max(1, Math.round((1 - greediness) * k)); + this.queueg = new NeighborQueue(queuegSize, false); + this.globalMinSimAcc = globalMinSimAcc; + } + + public TopKnnCollector( + int k, int visitLimit, MaxScoreAccumulator globalMinSimAcc, float greediness) { super(k, visitLimit); + this.greediness = greediness; this.queue = new NeighborQueue(k, false); + this.queueg = new NeighborQueue(Math.round((1 - greediness) * k), false); + this.globalMinSimAcc = globalMinSimAcc; } @Override public boolean collect(int docId, float similarity) { - return queue.insertWithOverflow(docId, similarity); + boolean result = queue.insertWithOverflow(docId, similarity); + queueg.insertWithOverflow(docId, similarity); + + boolean reachedKResults = (kResultsCollected == false && queue.size() == k()); + if (reachedKResults) { + kResultsCollected = true; + } + if (globalMinSimAcc != null && kResultsCollected) { + // as we've collected k results, we can start exchanging globally + globalMinSimAcc.accumulate(queue.topNode(), queue.topScore()); + + // periodically update the local copy of global similarity + if (reachedKResults || (visitedCount & globalMinSimAcc.modInterval) == 0) { + MaxScoreAccumulator.DocAndScore docAndScore = globalMinSimAcc.get(); + cachedGlobalMinSim = docAndScore.score; Review Comment: We only peek at the global minimum similarity after we have accumulated `k` local results. For my understanding, what would happen if we were to start using `globalMinSim` as soon as any leaf has collected its top K results? I'm curious how the graph structure and entry point calculation affects this. Looking at `graphSearcher.findBestEntryPoint(...)`, i think we navigate to the closest node we can find to our query on layer-1 and use that as the starting point for layer-0. Now layer-0 is a lot more dense than layer-1, and our best candidates are likely in the space between the entry-point on layer-1 and all its neighbors. Is it always a good idea to at least look at k nodes around the entry point, regardless of the global min. competitive score? ########## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ########## @@ -26,26 +26,71 @@ * @lucene.experimental */ public final class TopKnnCollector extends AbstractKnnCollector { + private static final float DEFAULT_GREEDINESS = 0.9f; Review Comment: So `greediness` is essentially how "greedy" our algorithm for picking top matches gets to be. At 1, we go with the global min competitive similarity. And by default, we continue to search the segment as long as we find neighbors that are better than the top 10% of our local picks from the segment `(1-0.9)*k)`. Should we add some documentation around this param? My initial, naive impression was that higher greediness might mean we pick more nodes per segment, which is quite the opposite. :) ########## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ########## @@ -26,26 +26,71 @@ * @lucene.experimental */ public final class TopKnnCollector extends AbstractKnnCollector { + private static final float DEFAULT_GREEDINESS = 0.9f; private final NeighborQueue queue; + private final float greediness; + private final NeighborQueue queueg; + private final MaxScoreAccumulator globalMinSimAcc; + private boolean kResultsCollected = false; + private float cachedGlobalMinSim = Float.NEGATIVE_INFINITY; + + // greediness of globally non-competitive search: [0,1] /** * @param k the number of neighbors to collect * @param visitLimit how many vector nodes the results are allowed to visit + * @param globalMinSimAcc the global minimum competitive similarity tracked across all segments */ - public TopKnnCollector(int k, int visitLimit) { + public TopKnnCollector(int k, int visitLimit, MaxScoreAccumulator globalMinSimAcc) { + super(k, visitLimit); + this.greediness = DEFAULT_GREEDINESS; + this.queue = new NeighborQueue(k, false); + int queuegSize = Math.max(1, Math.round((1 - greediness) * k)); + this.queueg = new NeighborQueue(queuegSize, false); + this.globalMinSimAcc = globalMinSimAcc; + } + + public TopKnnCollector( + int k, int visitLimit, MaxScoreAccumulator globalMinSimAcc, float greediness) { super(k, visitLimit); + this.greediness = greediness; this.queue = new NeighborQueue(k, false); + this.queueg = new NeighborQueue(Math.round((1 - greediness) * k), false); + this.globalMinSimAcc = globalMinSimAcc; } @Override public boolean collect(int docId, float similarity) { - return queue.insertWithOverflow(docId, similarity); + boolean result = queue.insertWithOverflow(docId, similarity); + queueg.insertWithOverflow(docId, similarity); + + boolean reachedKResults = (kResultsCollected == false && queue.size() == k()); Review Comment: So `reachedKResults` is supposed to be `true` only for the first time we get to k results, while `kResultsCollected` continues to be true even as we collect subsequent hits? Minor: Should we try to make it more explicit in the variable name? Maybe something like `firstKResultsCollected`? ########## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ########## @@ -26,26 +26,71 @@ * @lucene.experimental */ public final class TopKnnCollector extends AbstractKnnCollector { + private static final float DEFAULT_GREEDINESS = 0.9f; private final NeighborQueue queue; + private final float greediness; + private final NeighborQueue queueg; + private final MaxScoreAccumulator globalMinSimAcc; + private boolean kResultsCollected = false; + private float cachedGlobalMinSim = Float.NEGATIVE_INFINITY; + + // greediness of globally non-competitive search: [0,1] /** * @param k the number of neighbors to collect * @param visitLimit how many vector nodes the results are allowed to visit + * @param globalMinSimAcc the global minimum competitive similarity tracked across all segments */ - public TopKnnCollector(int k, int visitLimit) { + public TopKnnCollector(int k, int visitLimit, MaxScoreAccumulator globalMinSimAcc) { + super(k, visitLimit); + this.greediness = DEFAULT_GREEDINESS; + this.queue = new NeighborQueue(k, false); + int queuegSize = Math.max(1, Math.round((1 - greediness) * k)); + this.queueg = new NeighborQueue(queuegSize, false); Review Comment: This queue is still local to each leaf, and hence to each task in the executor, right? I ask because I see some references to a _global queue_ for greediness, in the issue description. ########## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ########## @@ -79,24 +81,30 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException { filterWeight = null; } + MaxScoreAccumulator globalMinSimAcc = new MaxScoreAccumulator(); TaskExecutor taskExecutor = indexSearcher.getTaskExecutor(); List<LeafReaderContext> leafReaderContexts = reader.leaves(); List<Callable<TopDocs>> tasks = new ArrayList<>(leafReaderContexts.size()); for (LeafReaderContext context : leafReaderContexts) { - tasks.add(() -> searchLeaf(context, filterWeight)); + tasks.add(() -> searchLeaf(context, filterWeight, globalMinSimAcc)); Review Comment: Is the `MaxScoreAccumulator` thread safe? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org