Re: [PR] Removed Scorer#getWeight [lucene]
iamsanjay commented on PR #13440: URL: https://github.com/apache/lucene/pull/13440#issuecomment-2141366659 1. There were three classes that are using weight in some way so I leave them to be for now. `FunctionQuery#AllScorer` https://github.com/apache/lucene/blob/750a7c4d3b3e174023404bf363861dae31413901/lucene/queries/src/java/org/apache/lucene/queries/function/FunctionQuery.java#L117 `TermAutomatonScorer` https://github.com/apache/lucene/blob/750a7c4d3b3e174023404bf363861dae31413901/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/TermAutomatonScorer.java#L68 `TestMinShouldMatch2#SlowMinShouldMatchScorer` https://github.com/apache/lucene/blob/750a7c4d3b3e174023404bf363861dae31413901/lucene/core/src/test/org/apache/lucene/search/TestMinShouldMatch2.java#L357 2. As I removed weight, few ctors -- `JustCompileScorer`, `SimpleScorer`, `TestScoreCachingWrappingScorer#SimpleScorer`, `Scorer` -- have been reduced to default ctor. Should I removed those as well? 3. As we removing Weight, some APIs that were calling with weight is not required. Hence I removed the weight from their method signature as well -- Check: FunctionValues#getRangeScorer and all the respective overrides. 4. `SpatialQuery` has also one method where we could possibly change the API to remove the weight parameter as it is not required now. Should I change that as well? There is a possibility we may found some more APIs where the weight can be removed now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] WIP - Add minimum number of segments to TieredMergePolicy [lucene]
carlosdelest commented on code in PR #13430: URL: https://github.com/apache/lucene/pull/13430#discussion_r1620451379 ## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ## @@ -93,6 +93,7 @@ public class TieredMergePolicy extends MergePolicy { private double segsPerTier = 10.0; private double forceMergeDeletesPctAllowed = 10.0; private double deletesPctAllowed = 20.0; + private int targetSearchConcurrency = -1; Review Comment: Yes - I iterated a bit on this but now it's equivalent. Thanks! ## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ## @@ -522,21 +550,28 @@ private MergeSpecification doFindMerges( final List candidate = new ArrayList<>(); Review Comment: I see some test failures, failing to merge segments that are eligible both from a bytes and doc perspective. It seems related to [this line here](https://github.com/apache/lucene/pull/13430/files#diff-da7b621934abf1e7bc74d6bf15f17ddd95cffa4f374776125d8de5cdc0054b2dL507): ```java if (mergeType == MERGE_TYPE.NATURAL && sortedEligible.size() <= allowedSegCount && remainingDelCount <= allowedDelCount) { return spec; } ``` Here we're bailing out if we don't have enough eligible segments as the allowed seg count. When adding more allowed segment counts with the target search concurrency, this optimization makes it impossible to merge segments that are available for merging, and thus get over budget and fail the tests. I don't get this optimization - why shouldn't we try to merge if we have less sorted eligibles? What am I missing to change this so we can take into account the bigger number of `allowedSegCount` that comes with adding target search concurrency? ## lucene/core/src/test/org/apache/lucene/index/TestTieredMergePolicy.java: ## @@ -431,6 +447,128 @@ public void testForcedMergesRespectSegSize() throws Exception { dir.close(); } + public void testForcedMergesRespectsTargetSearchConcurrency() throws Exception { Review Comment: I've given it a try - LMKWYT! ## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ## @@ -409,6 +433,8 @@ public MergeSpecification findMerges( // allowedSegCount may occasionally be less than segsPerTier // if segment sizes are below the floor size allowedSegCount = Math.max(allowedSegCount, segsPerTier); +// Override with the targetSearchConcurrency if it is greater +allowedSegCount = Math.max(allowedSegCount, targetSearchConcurrency); Review Comment: Correct, now the highest tier needs to take that into account. Thanks! 👍 ## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ## @@ -257,6 +258,25 @@ public double getSegmentsPerTier() { return segsPerTier; } + /** + * Sets the target search concurrency. This allows merging to ensure that there are at least + * targetSearchConcurrency segments on the top tier. This setting can be overriden by force Review Comment: Makes sense, updating. ## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ## @@ -581,11 +621,14 @@ private MergeSpecification doFindMerges( // whose length is less than the merge factor, it means we are reaching // the tail of the list of segments and will only find smaller merges. // Stop here. -if (bestScore != null && hitTooLarge == false && candidate.size() < mergeFactor) { +if (bestScore != null +&& hitTooLarge == false +&& hitMaxDocs == false +&& candidate.size() < mergeFactor) { break; } -final MergeScore score = score(candidate, hitTooLarge, segInfosSizes); +final MergeScore score = score(candidate, hitTooLarge || hitMaxDocs, segInfosSizes); Review Comment: I see - I was thinking in doing the same with docs than with bytes, but I see they're quite different. Docs is more a hint to maintain the number of segments we want, but bytes provides a hard limit on size and also score guidance on segments to merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Reproducible failure TestSimpleTextKnnVectorsFormat.testFloatVectorScorerIteration [lucene]
ChrisHegarty opened a new issue, #13443: URL: https://github.com/apache/lucene/issues/13443 ``` reproduce with: gradlew test --tests TestSimpleTextKnnVectorsFormat.testFloatVectorScorerIteration -Dtests.seed=D5C839ED84ABC868 -Dtests.locale=kkj-Latn-CM -Dtests.timezone=Europe/Belgrade -Dtests.asserts=true -Dtests.file.encoding=UTF-8 ``` ``` org.apache.lucene.codecs.simpletext.TestSimpleTextKnnVectorsFormat > testFloatVectorScorerIteration FAILED java.lang.AssertionError: expected null, but was: at __randomizedtesting.SeedInfo.seed([D5C839ED84ABC868:4E2CA6F8B4C19413]:0) at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotNull(Assert.java:756) at org.junit.Assert.assertNull(Assert.java:738) at org.junit.Assert.assertNull(Assert.java:748) at org.apache.lucene.tests.index.BaseKnnVectorsFormatTestCase.testFloatVectorScorerIteration(BaseKnnVectorsFormatTestCase.java:767) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:580) ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] SimpleText[Float|Byte]VectorValues::scorer should return null when the vector values is empty [lucene]
ChrisHegarty opened a new pull request, #13444: URL: https://github.com/apache/lucene/pull/13444 This commit ensures that `SimpleText[Float|Byte]VectorValues::scorer` returns null when the vector values is empty, as per the scorer javadoc. Other KnnVectorsReader implementations have specialised empty implementations that do similar, e.g. `OffHeapFloatVectorValues.EmptyOffHeapVectorValues`. The `VectorScorer` interface in new in Lucene 9.11, see #13181 An existing test randomly hits this, but a new test has been added that exercises this code path consistently. It's also useful to verify other KnnVectorsReader implementations. closes #13443 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Removed Scorer#getWeight [lucene]
jpountz commented on PR #13440: URL: https://github.com/apache/lucene/pull/13440#issuecomment-2141925741 > We are also removing weight from Subclasses of Scorer as well, right? Yes, that would be great. If there are a few sub classes that really need a Weight, we could keep it, it looks like you identitfied a few of them already. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Removed Scorer#getWeight [lucene]
jpountz commented on PR #13440: URL: https://github.com/apache/lucene/pull/13440#issuecomment-2141965592 > few ctors have been reduced to default ctor. Should I removed those as well? For those that are in the `java` folder, I'd keep them and add javadocs (I thought the build checked that). For those under the `test` folder, we can remove. > Check: FunctionValues#getRangeScorer Good question. Your change looks good this way, but I know Solr is a heavy user of `FunctionValues`, I'm curious if that would be annoying to them. @gerlowskija @dsmiley @HoustonPutman Do you have an opinion on this? > SpatialQuery has also one method where we could possibly change the API to remove the weight parameter as it is not required now. Should I change that as well? Yes. It looks like it was only there to be able to pass the Weight to a Scorer constructor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Instrument IndexOrDocValuesQuery to report on its decisions [lucene]
jpountz commented on issue #13442: URL: https://github.com/apache/lucene/issues/13442#issuecomment-2141990962 Thinking out loud: I'd like queries to remain as close to value classes as possible, just describing an information need. I believe that the change you're suggesting would require storing a listener on `IndexOrDocValuesQuery`, which would go against this. Maybe we should introduce a more general framework to allow queries and collectors to report about some interesting decisions they make, and keep the state on e.g. `IndexSearcher` rather than `Query`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] SimpleText[Float|Byte]VectorValues::scorer should return null when the vector values is empty [lucene]
ChrisHegarty commented on PR #13444: URL: https://github.com/apache/lucene/pull/13444#issuecomment-2142165781 This commit ensures that SimpleText[Float|Byte]VectorValues::scorer returns null when the vector values is empty, as per the scorer javadoc. Other KnnVectorsReader implementations have specialised empty implementations that do similar, e.g. OffHeapFloatVectorValues.EmptyOffHeapVectorValues. The VectorScorer interface in new in Lucene 9.11, see https://github.com/apache/lucene/pull/13181 An existing test randomly hits this, but a new test has been added that exercises this code path consistently. It's also useful to verify other KnnVectorsReader implementations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Reproducible failure TestSimpleTextKnnVectorsFormat.testFloatVectorScorerIteration [lucene]
ChrisHegarty closed issue #13443: Reproducible failure TestSimpleTextKnnVectorsFormat.testFloatVectorScorerIteration URL: https://github.com/apache/lucene/issues/13443 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] SimpleText[Float|Byte]VectorValues::scorer should return null when the vector values is empty [lucene]
ChrisHegarty merged PR #13444: URL: https://github.com/apache/lucene/pull/13444 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new test case "testGetLines" for lucene/core/analysis/WordlistLoader [lucene]
hack4chang commented on code in PR #13419: URL: https://github.com/apache/lucene/pull/13419#discussion_r1621365979 ## lucene/core/src/test/org/apache/lucene/analysis/TestWordlistLoader.java: ## @@ -77,4 +82,17 @@ public void testSnowballListLoading() throws IOException { assertTrue(wordset.contains("six")); assertTrue(wordset.contains("seven")); } + + public void testGetLines() throws IOException { +String s = "One \n#Comment \n \n Two \n Three \n"; +Charset charset = StandardCharsets.UTF_8; +byte[] sByteArr = s.getBytes(charset); +InputStream sInputStream = new ByteArrayInputStream(sByteArr); +List lines = WordlistLoader.getLines(sInputStream, charset); +assertEquals(3, lines.size()); +assertFalse(lines.contains("#Comment")); Review Comment: Hi, yeah sounds reasonable, I will make a change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add new dynamic confidence interval configuration to scalar quantized format [lucene]
benwtrent opened a new pull request, #13445: URL: https://github.com/apache/lucene/pull/13445 When int4 scalar quantization was merged, it added a new way to dynamically calculate quantiles. However, when that was merged, I inadvertently changed the default behavior, where a `null` confidenceInterval would actually calculate the dynamic quantiles instead of doing the previous auto-setting to `1 - 1/(dim + 1)`. This commit formalizes the dynamic quantile calculate through setting the confidenceInterval to `0`, and preserves the previous behavior for `null` confidenceIntervals so that users upgrading will not see different quantiles than they would expect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Removed Scorer#getWeight [lucene]
dsmiley commented on PR #13440: URL: https://github.com/apache/lucene/pull/13440#issuecomment-2142590119 It's not clear why Solr would care with regards to FunctionValues & Weights in particular. I don't notice Solr using Weights there but maybe I'm not looking in quite the right spot? if Scorer.getWeight is being removed, I only see a couple places in Solr where this is used, both in LTR, interestingly. This shouldn't stop Lucene from doing what it ought to do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new dynamic confidence interval configuration to scalar quantized format [lucene]
benwtrent commented on PR #13445: URL: https://github.com/apache/lucene/pull/13445#issuecomment-2142591533 I am planning on putting this in 9.11 as it is also a fix for breaking BWC that 9.11 currently does when using the same format configuration between 9.10 vs. 9.11. In 9.10, supplying a `null` confidence interval indicated a static interval of `1 - 1/(dim + 1)`, without this commit, we break this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org