Re: [PR] Add Bulk Scorer For ToParentBlockJoinQuery [lucene]
javanna commented on PR #13697: URL: https://github.com/apache/lucene/pull/13697#issuecomment-2352173213 There's a couple of recent test failures, in main as well as 9x, that may have to do with this change, judging from the area that it touches: ``` FAILED: org.apache.lucene.search.join.TestBlockJoinBulkScorer.testSetMinCompetitiveScoreWithScoreModeMax Error Message: java.lang.AssertionError: expected:<{16=5.0, 10=10.0, 2=6.0}> but was:<{2=6.0, 10=10.0}> Stack Trace: java.lang.AssertionError: expected:<{16=5.0, 10=10.0, 2=6.0}> but was:<{2=6.0, 10=10.0}> at __randomizedtesting.SeedInfo.seed([8531AA8A82E0A86C:D0DE9F82717ED75C]:0) at app/junit@4.13.1/org.junit.Assert.fail(Assert.java:89) at app/junit@4.13.1/org.junit.Assert.failNotEquals(Assert.java:835) at app/junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:120) at app/junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:146) at app//org.apache.lucene.search.join.TestBlockJoinBulkScorer.assertScores(TestBlockJoinBulkScorer.java:301) at app//org.apache.lucene.search.join.TestBlockJoinBulkScorer.testSetMinCompetitiveScoreWithScoreModeMax(TestBlockJoinBulkScorer.java:397) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add Bulk Scorer For ToParentBlockJoinQuery [lucene]
jpountz commented on PR #13697: URL: https://github.com/apache/lucene/pull/13697#issuecomment-2352178493 FYI @Mikep86 opened a PR at https://github.com/apache/lucene/pull/13785. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix Flaky Test In TestBlockJoinBulkScorer [lucene]
jpountz commented on code in PR #13785: URL: https://github.com/apache/lucene/pull/13785#discussion_r1760649790 ## lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoinBulkScorer.java: ## @@ -283,7 +292,33 @@ public void collect(int doc) throws IOException { null, 0, NO_MORE_DOCS); -assertEquals(expectedScores, actualScores); + +if (expectedScoresList.size() == 1) { + assertEquals(expectedScoresList.getFirst(), actualScores); +} else { + assertEqualsToOneOf(expectedScoresList, actualScores); +} + } + + private static void assertEqualsToOneOf(List expectedList, Object actual) { +boolean foundMatch = false; +for (Object expected : expectedList) { + if (expected == null) { +if (actual == null) { + foundMatch = true; + break; +} + } else { +foundMatch = expected.equals(actual); +if (foundMatch) { + break; +} + } Review Comment: nit: maybe merge these two conditions under a single `if (Objects.equals(expected, actual))` which would take care of the `null` case? ## lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoinBulkScorer.java: ## @@ -283,7 +292,33 @@ public void collect(int doc) throws IOException { null, 0, NO_MORE_DOCS); -assertEquals(expectedScores, actualScores); + +if (expectedScoresList.size() == 1) { + assertEquals(expectedScoresList.getFirst(), actualScores); +} else { + assertEqualsToOneOf(expectedScoresList, actualScores); +} Review Comment: I'm curious why you don't call `assertEqualsToOneOf` all the time, shouldn't it also work in the size==1 case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up advancing within a block. [lucene]
jpountz commented on PR #13692: URL: https://github.com/apache/lucene/pull/13692#issuecomment-2352315844 Woops, sorry I'm only seeing your reply now. The above analysis you referred to uses branchless binary search over the full buffer of 128 doc IDs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up advancing within a block. [lucene]
jpountz commented on PR #13692: URL: https://github.com/apache/lucene/pull/13692#issuecomment-2352330831 I reverted this change. While it was a good win on average on my machine, it was almost a net loss on nightly benchmarks. So I'd rather keep the current linear scan approach, which has a lower overhead when creating new `PostingsEnum` objects and is simpler. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Deprecate BulkScorer#score(LeafReaderContext, Bits) [lucene]
javanna opened a new pull request, #13794: URL: https://github.com/apache/lucene/pull/13794 We have removed BulkScorer#score(LeafReaderContext, Bits) in main in favour of BulkScorer#score(LeafCollector collector, Bits acceptDocs, int min, int max) as part of #13542. This commit deprecates the method in 9.x. Internal usages of it are left untouched as there may be subclasses that override it, which we don't want to break in a minor release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for intra-segment search concurrency [lucene]
javanna commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1760803091 ## lucene/core/src/java/org/apache/lucene/search/BulkScorer.java: ## @@ -27,18 +27,6 @@ */ public abstract class BulkScorer { - /** - * Scores and collects all matching documents. - * - * @param collector The collector to which all matching documents are passed. - * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} - * if they are all allowed to match. - */ - public void score(LeafCollector collector, Bits acceptDocs) throws IOException { Review Comment: I opened this after all: https://github.com/apache/lucene/pull/13794 . It deprecates the method but leaves its usages untouched to avoid breaking users that override it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Deprecate BulkScorer#score(LeafReaderContext, Bits) [lucene]
jpountz commented on code in PR #13794: URL: https://github.com/apache/lucene/pull/13794#discussion_r1760821555 ## lucene/core/src/java/org/apache/lucene/search/BulkScorer.java: ## @@ -33,7 +33,11 @@ public abstract class BulkScorer { * @param collector The collector to which all matching documents are passed. * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} * if they are all allowed to match. + * @deprecated in favour of {@link #score(LeafCollector, Bits, int, int)}. Callers should instead + * call the method variant that takes min and max as arguments. Subclasses that override it Review Comment: maybe also suggest passing min=0 and max=DocIdSetIterator.NO_MORE_DOCS here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Deprecate BulkScorer#score(LeafReaderContext, Bits) [lucene]
javanna commented on code in PR #13794: URL: https://github.com/apache/lucene/pull/13794#discussion_r1760852834 ## lucene/core/src/java/org/apache/lucene/search/BulkScorer.java: ## @@ -33,7 +33,11 @@ public abstract class BulkScorer { * @param collector The collector to which all matching documents are passed. * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} * if they are all allowed to match. + * @deprecated in favour of {@link #score(LeafCollector, Bits, int, int)}. Callers should instead + * call the method variant that takes min and max as arguments. Subclasses that override it Review Comment: good idea -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we auto-adjust top score doc and top field collector manager based on slices? [lucene]
javanna commented on issue #13791: URL: https://github.com/apache/lucene/issues/13791#issuecomment-2352478375 ++ to that @jpountz that would also be my preferred approach. I somehow made the assumption that we are not willing to go that route, but I think we should revisit that decision, assuming the slowdown is small. It feels wrong for sure that users have to know about internal execution when creating a collector manager. Ideally, providing slices to adapt the behaviour would not be necessary either. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Deprecate BulkScorer#score(LeafReaderContext, Bits) [lucene]
javanna merged PR #13794: URL: https://github.com/apache/lucene/pull/13794 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Deprecate BulkScorer#score(LeafReaderContext, Bits) [lucene]
javanna commented on PR #13794: URL: https://github.com/apache/lucene/pull/13794#issuecomment-2352510100 Thanks @jpountz ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove CollectorManager#forSequentialExecution [lucene]
javanna commented on PR #13790: URL: https://github.com/apache/lucene/pull/13790#issuecomment-2352511247 @gsmiller how do you feel about this one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for intra-segment search concurrency [lucene]
javanna commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1760911695 ## lucene/facet/src/java/org/apache/lucene/facet/FacetsCollector.java: ## @@ -97,12 +97,12 @@ public List getMatchingDocs() { public void collect(int doc) throws IOException { docsBuilder.grow(1).add(doc); if (keepScores) { - if (totalHits >= scores.length) { -float[] newScores = new float[ArrayUtil.oversize(totalHits + 1, 4)]; -System.arraycopy(scores, 0, newScores, 0, totalHits); + if (doc >= scores.length) { +float[] newScores = new float[ArrayUtil.oversize(doc + 1, 4)]; +System.arraycopy(scores, 0, newScores, 0, doc); scores = newScores; } - scores[totalHits] = scorer.score(); + scores[doc] = scorer.score(); Review Comment: thanks for looking! I noticed that we document the scores array as non-sparse and this change makes it indeed sparse. I think I could have been smarter here and keep it non-sparse. I will try to address this. It feels bad that we allocate such a big array and keep on resizing it as well. We should rather then build in advance knowing how many docs the leaf has. I will see what I can do about this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix Flaky Test In TestBlockJoinBulkScorer [lucene]
Mikep86 commented on code in PR #13785: URL: https://github.com/apache/lucene/pull/13785#discussion_r1761059326 ## lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoinBulkScorer.java: ## @@ -283,7 +292,33 @@ public void collect(int doc) throws IOException { null, 0, NO_MORE_DOCS); -assertEquals(expectedScores, actualScores); + +if (expectedScoresList.size() == 1) { + assertEquals(expectedScoresList.getFirst(), actualScores); +} else { + assertEqualsToOneOf(expectedScoresList, actualScores); +} Review Comment: It should and I originally did that, but if you use `assertEquals` there are nicer IDE integrations for value comparison on failure. Given that this method is called with a single map of expected scores in nearly all cases, I tried to preserve that functionality. This could help in the case of a failure in `testScoreRandomIndices`, which could have a large random score map. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix Flaky Test In TestBlockJoinBulkScorer [lucene]
jpountz commented on code in PR #13785: URL: https://github.com/apache/lucene/pull/13785#discussion_r1761068214 ## lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoinBulkScorer.java: ## @@ -283,7 +292,33 @@ public void collect(int doc) throws IOException { null, 0, NO_MORE_DOCS); -assertEquals(expectedScores, actualScores); + +if (expectedScoresList.size() == 1) { + assertEquals(expectedScoresList.getFirst(), actualScores); +} else { + assertEqualsToOneOf(expectedScoresList, actualScores); +} Review Comment: OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix Flaky Test In TestBlockJoinBulkScorer [lucene]
jpountz merged PR #13785: URL: https://github.com/apache/lucene/pull/13785 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make Operations#optional create simpler automata. [lucene]
jpountz merged PR #13793: URL: https://github.com/apache/lucene/pull/13793 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] "cz" (vs ISO langauge code "cs") for Czech analysis package? [LUCENE-6366] [lucene]
WEBCON-BPS-DEV commented on issue #7426: URL: https://github.com/apache/lucene/issues/7426#issuecomment-2352942882 Is there any update on the issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
msokolov commented on PR #13779: URL: https://github.com/apache/lucene/pull/13779#issuecomment-2353036120 I pushed a new revision here addressing some of the major comments: 1. `KnnVectorValues.iterator()` now generally provides a new iterator; no caching is done. I removed `createIterator()`. Main impact was on VectorScorer (and in tests) where we now create iterators and store them locally. This is much better; thanks for the feedback. 2. I added implementations for `advance()` and got rid of the default impl. 3. I *removed* impls of `cost()` and added a default impl that throws UOE. This method is only ever used during search() and most of these values sources will never be searched. The exceptions are those that can be used by the ValueSource API: basically the indexed values returned by a reader. We have lots and lots of other values impls that are used during indexing for which we don't need cost. I briefly considered separating these new iterators from DISI, but that ended up in some trouble. 4. re: `getVectorByteLength()` @ChrisHegarty is correct that this is needed as it is today. We could in theory make it final (or inline it whatever) if we added some more VectorEncodings to represent the compressed cases, but I'm inclined to leave it as is. This way we could in theory support a variable size encoding? And anyway it isn't clear we want to mix up the "encoding" with compression? I didn't have a chance to look seriously at removing `copy()` API. I don't think we ought to create a simple wrapper though since afaict it would require an additional memory copy of every vector value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
msokolov commented on PR #13779: URL: https://github.com/apache/lucene/pull/13779#issuecomment-2353098829 OK there seem to be some test failures ... I did a complete run, but randomized testing always seems to ferret out something interesting! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Extended spell checker with phrase support and adaptive user session analysis. [LUCENE-626] [lucene]
Menahali commented on issue #1701: URL: https://github.com/apache/lucene/issues/1701#issuecomment-2353104696 > Karl Wettin ([migrated from JIRA](https://issues.apache.org/jira/browse/LUCENE-626?focusedCommentId=12477688&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12477688)) > > Nicolas Lalevée [03/Mar/07 01:04 PM] > > This feature looks interesting, but why should it depend on #1628 ? > > It use the Index (notification, unison index factory methods, et c.) and IndexFacade (cache, fresh reader/searcher et c.) available in that patch. And by doing that, it also enables me to use InstantiatedIndex for the a priori corpus and ngram index to speed up the response time even more. > > I can't mack any thing in my program the baker closed every thing in my githup program -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] SpanOrQuery uses IDFs of failed subqueries in score calculation. [lucene]
tkarampAlpha opened a new issue, #13796: URL: https://github.com/apache/lucene/issues/13796 ### Description It seems that for SpanOrQuery IDF of terms belonging in subqueries that will not match a given document, will affect said document's score. I have observed this through on which I have 3 documents: ``` doc1: field: something doc2: field: nothing doc3: field: anything ``` And I issue the following query: ```spanOr([Contents:something, Contents:nothing])``` If you check at the score explanation you will notice that in both document's score the idf of both terms affects it even though for each document only one matches. This is an example of the explanation of the first document's score: ``` 3.9616547 = weight(spanOr([Contents:something, Contents:nothing]) in 0) [AsBM25Similarity], result of: 3.9616547 = score(freq=1.0), computed as boost * idf * tf from: 51.0 = boost 3.9616585 = idf, sum of: 1.9808292 = idf for term nothing , computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) + 1 from: 1 = docFreq 3 = docCount 1.9808292 = idf for term something , computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) + 1 from: 1 = docFreq 3 = docCount 0.019607842 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from: 1.0 = phraseFreq=1.0 50.0 = k1, term saturation parameter 0.0 = b, length normalization parameter 1.0 = dl, length of field 2.0 = avgdl, average length of field ``` ### Version and environment details lucene 9.7.0 through solr 9.3.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
msokolov commented on PR #13779: URL: https://github.com/apache/lucene/pull/13779#issuecomment-2353215032 Regarding the rename of `fromOrdToDoc` to `all` I think it was not helpful and plan to revert or maybe come up with some other name. The problem is we also have `createDenseIterator` which is also `all`. Essentially we have Sparse and Dense all-iterators. Maybe instead of fromOrdToDoc we can say `createSparseIterator`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
jpountz commented on PR #13779: URL: https://github.com/apache/lucene/pull/13779#issuecomment-2353373827 FWIW I started playing with removing copy() by replacing it with a factory method for a dictionary: https://github.com/msokolov/lucene/commit/ae7aca32a690a4b21a3da793258ce17560b551e7. Not sure how far I'll go. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
msokolov commented on PR #13779: URL: https://github.com/apache/lucene/pull/13779#issuecomment-2353397543 I also just started trying to replace `copy()` with the approach of adding `vectorValue(int ord, float[] outValue)` although this does add a copy operation in some cases where previously we would expose internal storage so I'm not sure it's great -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up advancing within a block. [lucene]
mikemccand commented on PR #13692: URL: https://github.com/apache/lucene/pull/13692#issuecomment-2353416849 > I plotted the number of docs that queries need to skip within a block when advancing This is a really cool chart! Maybe we could somehow dynamically optimize, picking the right search (linear, vs (branchless) binary) depending on the stats/behavior of the actual blocks we've encountered so far? It'd be a simple form of query optimization ... but that's maybe more code complexity than is worth it ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix Flaky Test In TestBlockJoinBulkScorer [lucene]
javanna commented on PR #13785: URL: https://github.com/apache/lucene/pull/13785#issuecomment-2353671468 Thanks for fixing this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove CollectorManager#forSequentialExecution [lucene]
gsmiller commented on PR #13790: URL: https://github.com/apache/lucene/pull/13790#issuecomment-2354128424 tl;dr: I agree with removing this. I was [initially hesitant](https://github.com/apache/lucene/pull/13735#issuecomment-2338340094) to add this for a lot of the same reasons, but was convinced when realizing we could at least explicitly fail by asserting that more than one collector is never created (so at least we wouldn't be silently creating tear-your-hair-out concurrency bugs). But looking at the bigger picture, the whole point of removing `IndexSearcher#search(Query, Collector)` is to help users avoid traps situations where they think they're setup for concurrent search but not utilizing it. I agree it's a worse trap in a lot of ways to introduce this failure mode. I particularly do not like that this "helper" collector manager can't _actually_ check if it's being used with a concurrency-ready IndexSearcher. So I can imagine a real failure case where IndexSearchers are being created with an Executor but only have a single segment and things appear to work—but then break later (e.g., users that are force-mergi ng their segments, or users that have sparse test coverage that only includes testing over single segments, etc.). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]
vigyasharma commented on issue #13387: URL: https://github.com/apache/lucene/issues/13387#issuecomment-2354135495 I wonder if we can leverage IndexWriter's `addIndexes(Directory... dirs)` [API](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L2984) for this. We could create separate indexes for every category (log groups 2xx, 4xx, 5xx in the example here), and combine them into one using this API. Internally, this version of the API simply copies over all segment files in the directory, so it should be pretty fast. This could mean that each shard for an OpenSearch/Elasticsearch index would maintain internal indexes for each desired category, and use the API to combine them into a common "shard" index at every flush? We'd still need a way to maintain category labels for a segment during merging, but that's a common problem for any approach we take. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] First-class random access API for KnnVectorValues [lucene]
navneet1v commented on code in PR #13779: URL: https://github.com/apache/lucene/pull/13779#discussion_r1762496963 ## lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java: ## @@ -0,0 +1,281 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.index; + +import java.io.IOException; +import org.apache.lucene.codecs.lucene90.IndexedDISI; +import org.apache.lucene.document.KnnByteVectorField; +import org.apache.lucene.document.KnnFloatVectorField; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.util.Bits; + +/** + * This class abstracts addressing of document vector values indexed as {@link KnnFloatVectorField} + * or {@link KnnByteVectorField}. + * + * @lucene.experimental + */ +public abstract class KnnVectorValues { + + /** The iterator associated with these values. */ + protected DocIndexIterator iterator; + + /** Return the dimension of the vectors */ + public abstract int dimension(); + + /** + * Return the number of vectors for this field. + * + * @return the number of vectors returned by this iterator + */ + public abstract int size(); + + /** + * Return the docid of the document indexed with the given vector ordinal. This default + * implementation returns the argument and is appropriate for dense values implementations where + * every doc has a single value. + */ + public int ordToDoc(int ord) { +return ord; + } + + /** + * Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access + * different values at once, to avoid overwriting the underlying vector returned. + */ + public abstract KnnVectorValues copy() throws IOException; + + /** Returns the vector byte length, defaults to dimension multiplied by float byte size */ + public int getVectorByteLength() { +return dimension() * getEncoding().byteSize; + } + + /** The vector encoding of these values. */ + public abstract VectorEncoding getEncoding(); + + /** Returns a Bits accepting docs accepted by the argument and having a vector value */ + public Bits getAcceptOrds(Bits acceptDocs) { +// FIXME: change default to return acceptDocs and provide this impl +// somewhere more specialized (in every non-dense impl). +if (acceptDocs == null) { + return null; +} +return new Bits() { + @Override + public boolean get(int index) { +return acceptDocs.get(ordToDoc(index)); + } + + @Override + public int length() { +return size(); + } +}; + } + + /** + * Return the iterator for this instance. If you need multiple iterators, call + * this.copy().iterator(). + */ + public DocIndexIterator iterator() { +if (iterator == null) { + iterator = createIterator(); +} +return iterator; + } + + /** + * Create an iterator for this instance; typically called once by iterator(). Wrapper + * value classes delegate to their inner instance's iterator and shouldn't implement this. + */ + protected DocIndexIterator createIterator() { +throw new UnsupportedOperationException(); + } + + /** + * A DocIdSetIterator that also provides an index() method tracking a distinct ordinal for a + * vector associated with each doc. + */ + public abstract static class DocIndexIterator extends DocIdSetIterator { + +/** return the value index (aka "ordinal" or "ord") corresponding to the current doc */ +public abstract int index(); + +@Override +public int advance(int target) throws IOException { + return slowAdvance(target); +} + +@Override +public long cost() { + throw new UnsupportedOperationException(); Review Comment: +1. Even in FloatVecotorValues cost() is returning size() only. https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/FloatVectorValues.java#L46-L48 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.