Re: [PR] TaskExecutor should not fork unnecessarily [lucene]
jpountz commented on PR #13472: URL: https://github.com/apache/lucene/pull/13472#issuecomment-2155882665 Your suggestion is how it used to work before: https://github.com/apache/lucene/pull/12499/files#diff-e744fc99cb74627f02e30c1cbda56dede66d2ecdfd57db2ce869b9a9a43fa41cR49-R64. The context switching isn't great indeed, but executing some tasks on the current thread makes it hard to correctly reason about and configure the number of threads that should perform work. The idea behind this other PR was that you would have a worker executor that would do almost all the work, and a coordination executor that would be mostly coordinating work in the worker threadpool. I'm not sure if we're at a point where this coordination executor could run on virtual threads, but at least conceptually this is how I'm thinking of it. Something tangential that we touched a couple times but haven't implemented for now would consist of introducing an API on `IndexSearcher` that doesn't require waiting on futures to complete, e.g. something like `public void search(Query query, CollectorManager collectorManager, IOConsumer resultConsumer)`. Also related, we are replacing some of the forking with the new support we introduced for I/O concurrency: https://github.com/apache/lucene/pull/13359/files#diff-ad7c504406afec8940592f1fda0062d3e5321cdc5693c24ec6f5cfb02f8dd90dL100-R114. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reciprocal Rank Fusion (RRF) in TopDocs [lucene]
jpountz commented on code in PR #13470: URL: https://github.com/apache/lucene/pull/13470#discussion_r1631987192 ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { +Map rrfScore = new HashMap<>(); +long minHits = Long.MAX_VALUE; +for (TopDocs topDoc : hits) { + minHits = Math.min(minHits, topDoc.totalHits.value); + Map scoreMap = new HashMap<>(); + for (ScoreDoc scoreDoc : topDoc.scoreDocs) { +scoreMap.put(scoreDoc.doc, scoreDoc.score); + } + + List> scoreList = new ArrayList<>(scoreMap.entrySet()); + scoreList.sort(Map.Entry.comparingByValue()); Review Comment: We don't seem to be using these `scoreMap` and `scoreList` anywhere? ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { +Map rrfScore = new HashMap<>(); +long minHits = Long.MAX_VALUE; +for (TopDocs topDoc : hits) { + minHits = Math.min(minHits, topDoc.totalHits.value); + Map scoreMap = new HashMap<>(); + for (ScoreDoc scoreDoc : topDoc.scoreDocs) { +scoreMap.put(scoreDoc.doc, scoreDoc.score); + } + + List> scoreList = new ArrayList<>(scoreMap.entrySet()); + scoreList.sort(Map.Entry.comparingByValue()); + + int rank = 1; + for (ScoreDoc scoreDoc : topDoc.scoreDocs) { +rrfScore.put(scoreDoc.doc, rrfScore.getOrDefault(scoreDoc.doc, 0.0f) + 1.0f / (rank + k)); +rank++; + } +} + +List> rrfScoreRank = new ArrayList<>(rrfScore.entrySet()); +rrfScoreRank.sort( +Map.Entry.comparingByValue().reversed()); // Sort in descending order + +ScoreDoc[] rrfScoreDocs = new ScoreDoc[Math.min(TopN, rrfScoreRank.size())]; +for (int i = 0; i < rrfScoreDocs.length; i++) { + rrfScoreDocs[i] = new ScoreDoc(rrfScoreRank.get(i).getKey(), rrfScoreRank.get(i).getValue()); Review Comment: Nit: we should preserve the original `shardIndex` that is configured on the `ScoreDoc` object that identifies the shard that it is coming from. ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ Review Comment: Users could use more details in these javadocs, e.g. what are `k` and `topN`? ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { +Map rrfScore = new HashMap<>(); +long minHits = Long.MAX_VALUE; +for (TopDocs topDoc : hits) { + minHits = Math.min(minHits, topDoc.totalHits.value); Review Comment: I wonder if it should be a `max` rather than a `min`? Presumably, hits from either top hits are considered as hits globally, and we are just using RRF to boost hits that are ranked high in multiple top hits? ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { +Map rrfScore = new HashMap<>(); Review Comment: Note that we should identify documents not only by their doc ID, but by the combination of `ScoreDoc.shardIndex` and `ScoreDoc.doc`, as there could be multiple documents that have the same doc ID but come from different shards. ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { Review Comment: nit: function arguments should use lower camel case ```suggestion public static TopDocs rrf(int topN, int k, TopDocs[] hits) { ``` ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int
Re: [PR] TaskExecutor should not fork unnecessarily [lucene]
original-brownbear commented on PR #13472: URL: https://github.com/apache/lucene/pull/13472#issuecomment-2155894536 > The idea behind this other PR was that you would have a worker executor that would do almost all the work, and a coordination executor that would be mostly coordinating work in the worker threadpool. But that's not what is happening practically right now? Whether I run N-1 tasks on the worker pool and one on the caller or N tasks on the worker and sleep on the caller thread, either way the coordinator thread is blocked and not coordinating anything in the meantime? I can see the argument of using the worker pool to limit task concurrency to x number of threads/cores. But I wonder since we are blocking the coordinating executors threads, leaving them idle, maybe there isn't that much coordinator work to do in the first place and both coordinator and worker pool can just be the same threads practically? If any of the tasks enqueued from the coordinating executor fans out again (and they do), that's how it works anyways. > Your suggestion is how it used to work before: Not quite I think: The difference is that in my approach the coordinator thread keeps pulling stuff off of the queue in a loop, instead of just doing the last task. This means that the coordinating thread will not be "wasted" as much if the worker pool takes time to execute the tasks and can't do them all in parallel. Also, it resolves the dead-lock issue for single threaded or otherwise saturated executors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] TaskExecutor should not fork unnecessarily [lucene]
original-brownbear commented on PR #13472: URL: https://github.com/apache/lucene/pull/13472#issuecomment-2155919543 > I'm not sure if we're at a point where this coordination executor could run on virtual threads, but at least conceptually this is how I'm thinking of it. That does make sense. But I wonder if this is something we should leave to the runtime/OS/... to figure out for us. It seems very desirable to limit the number of context switches when the API is synchronous so we can go through short tasks served from page cache with as little overhead as possible? > would consist of introducing an API on IndexSearcher that doesn't require waiting on futures to complete ++ that'd be really nice and save a lot of overhead here. That said we could optimize this code fairly easily to move from waiting on "futures" to waiting on a single future :) I haven't benchmarked this much, but if we see non-trivial overhead for the wait loop due to frequent wakeup-sleep cycles as we go through all of the futures, we could just have a ref-count around a single future couldn't we? > we are replacing some of the forking with the new support we introduced for I/O concurrency ❤️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] TaskExecutor should not fork unnecessarily [lucene]
jpountz commented on PR #13472: URL: https://github.com/apache/lucene/pull/13472#issuecomment-2155964757 Thanks for explaining I had not read your implementation carefully. I agree that we are doing less blocking than in the previous implementation of this, though we could still be blocking at times it seems? E.g. if you have two tasks and the first one takes more time? I understand what you are saying about reducing the overhead of forking, but I'm not too happy that it makes sizing thread pools more complicated in exchange? I need to think more about the trade-off. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] TaskExecutor should not fork unnecessarily [lucene]
original-brownbear commented on PR #13472: URL: https://github.com/apache/lucene/pull/13472#issuecomment-2155979514 >E.g. if you have two tasks and the first one takes more time? Right, that's a possible scenario, but unless we move to some kind of async API like the one you mentioned above, we'll always have blocking on the calling thread if there's a comparatively long running task running on one of the forked threads. > but I'm not too happy that it makes sizing thread pools more complicated in exchange? I need to think more about the trade-off. To me it seem like the opposite is true, this changes makes reasoning about the sizing much easier. I find it very complicated working out the relative sizes of worker pool and coordinator pool. I effectively want the worker pool just sized right so that I get the CPU utilisation I desire without oversubscribing and adding scheduling overhead. Now I have to add a coordinator pool that enables that into the mix. That one I have to size in such a way that I always have another thread available as long as my worker pool isn't fully utilised. That's quite hard to get right? With this change, I only have to size one pool and since blocking is rare I can probably ignore it in the calculation. So the size of the pool comes out to ~ `desired_busy_cores / (1 - io_share_of_execution_time)` doesn't it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] WIP expose FlatVectorsFormat [lucene]
msokolov commented on code in PR #13469: URL: https://github.com/apache/lucene/pull/13469#discussion_r1632032521 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsReader.java: ## @@ -217,6 +220,18 @@ public ByteVectorValues getByteVectorValues(String field) throws IOException { vectorData); } + @Override + public void search(String field, float[] target, KnnCollector knnCollector, Bits acceptDocs) throws IOException { Review Comment: Yes, of course, OK. So I think this will work like this if I leave the search() methods as throwing UnsupportedOperationException, then use PerFieldKnnVectorValues to tie a field to a Lucene99ScalarQuantizedVectorsFormat, at which point I can call getFloatVectorValues and use the returned scorer's score() method. Or perhaps call getQuantizedVectorValues directly and use *its* score method. I'll give it a try. I'm OK with all that. If this becomes popular we might want to find a way that doesn't require using codec-specific classes, but we can let that sit for now I think. Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reciprocal Rank Fusion (RRF) in TopDocs [lucene]
hack4chang commented on code in PR #13470: URL: https://github.com/apache/lucene/pull/13470#discussion_r1632101139 ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ Review Comment: Do we have to check if the k value is greater than or equal to 1 in the code? And maybe mention it in the javadocs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reciprocal Rank Fusion (RRF) in TopDocs [lucene]
hack4chang commented on code in PR #13470: URL: https://github.com/apache/lucene/pull/13470#discussion_r1632101267 ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { Review Comment: Oops, my bad. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reciprocal Rank Fusion (RRF) in TopDocs [lucene]
hack4chang commented on code in PR #13470: URL: https://github.com/apache/lucene/pull/13470#discussion_r1632104552 ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { +Map rrfScore = new HashMap<>(); +long minHits = Long.MAX_VALUE; +for (TopDocs topDoc : hits) { + minHits = Math.min(minHits, topDoc.totalHits.value); Review Comment: The totalHits was a tricky part that we didn't know what value to assign to. IIUC, The totalHits means all the matched Document in a query, and we couldn't really calculate the union of the totalHits for all the TopDocs. So for this min totalHits, I just want to assign a min totalHits for now, to match the totalHits relation "greater than or equal to". And I want to ask for your opinion on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reciprocal Rank Fusion (RRF) in TopDocs [lucene]
hack4chang commented on code in PR #13470: URL: https://github.com/apache/lucene/pull/13470#discussion_r1632104552 ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { +Map rrfScore = new HashMap<>(); +long minHits = Long.MAX_VALUE; +for (TopDocs topDoc : hits) { + minHits = Math.min(minHits, topDoc.totalHits.value); Review Comment: The totalHits was a tricky part that we didn't know what value to assign to. IIUC, The totalHits means all the matched Document in a query, and we couldn't really calculate the union of the totalHits for all the TopDocs. So for this min totalHits, I just wanted to assign a min totalHits temporarily, to match the totalHits relation "greater than or equal to". And I want to ask for your opinion on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Silence odd test runner warnings after gradle upgrade [lucene]
uschindler commented on PR #13471: URL: https://github.com/apache/lucene/pull/13471#issuecomment-2156173831 Will check tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Silence odd test runner warnings after gradle upgrade [lucene]
dweiss commented on PR #13471: URL: https://github.com/apache/lucene/pull/13471#issuecomment-2156174240 No worries, thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reciprocal Rank Fusion (RRF) in TopDocs [lucene]
hack4chang commented on code in PR #13470: URL: https://github.com/apache/lucene/pull/13470#discussion_r1632111769 ## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { +Map rrfScore = new HashMap<>(); +long minHits = Long.MAX_VALUE; +for (TopDocs topDoc : hits) { + minHits = Math.min(minHits, topDoc.totalHits.value); + Map scoreMap = new HashMap<>(); + for (ScoreDoc scoreDoc : topDoc.scoreDocs) { +scoreMap.put(scoreDoc.doc, scoreDoc.score); + } + + List> scoreList = new ArrayList<>(scoreMap.entrySet()); + scoreList.sort(Map.Entry.comparingByValue()); + + int rank = 1; + for (ScoreDoc scoreDoc : topDoc.scoreDocs) { +rrfScore.put(scoreDoc.doc, rrfScore.getOrDefault(scoreDoc.doc, 0.0f) + 1.0f / (rank + k)); +rank++; + } +} + +List> rrfScoreRank = new ArrayList<>(rrfScore.entrySet()); +rrfScoreRank.sort( +Map.Entry.comparingByValue().reversed()); // Sort in descending order + +ScoreDoc[] rrfScoreDocs = new ScoreDoc[Math.min(TopN, rrfScoreRank.size())]; +for (int i = 0; i < rrfScoreDocs.length; i++) { + rrfScoreDocs[i] = new ScoreDoc(rrfScoreRank.get(i).getKey(), rrfScoreRank.get(i).getValue()); Review Comment: So this was also a tricky part for us. For my understanding, the RRF would combine search result based on the different ranks of a documents in different results. We supposed to combine the ranks for all individual doucments, but a document come from different shards should be treated as different documents? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move synonym map off-heap for SynonymGraphFilter [lucene]
github-actions[bot] commented on PR #13054: URL: https://github.com/apache/lucene/pull/13054#issuecomment-2156239861 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org