Re: [PR] Removed Scorer#getWeight [lucene]

2024-05-31 Thread via GitHub


iamsanjay commented on PR #13440:
URL: https://github.com/apache/lucene/pull/13440#issuecomment-2141366659

   1. There were three classes that are using weight in some way so I leave 
them to be for now.
   
   `FunctionQuery#AllScorer`
   
https://github.com/apache/lucene/blob/750a7c4d3b3e174023404bf363861dae31413901/lucene/queries/src/java/org/apache/lucene/queries/function/FunctionQuery.java#L117
   
   `TermAutomatonScorer`
   
https://github.com/apache/lucene/blob/750a7c4d3b3e174023404bf363861dae31413901/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/TermAutomatonScorer.java#L68
   
   `TestMinShouldMatch2#SlowMinShouldMatchScorer`
   
https://github.com/apache/lucene/blob/750a7c4d3b3e174023404bf363861dae31413901/lucene/core/src/test/org/apache/lucene/search/TestMinShouldMatch2.java#L357
   
   2. As I removed weight, few ctors -- `JustCompileScorer`, `SimpleScorer`, 
`TestScoreCachingWrappingScorer#SimpleScorer`, `Scorer` -- have been reduced to 
default ctor. Should I removed those as well? 
   
   3. As we removing Weight, some APIs that were calling with weight is not 
required. Hence I removed the weight from their method signature as well -- 
Check: FunctionValues#getRangeScorer and all the respective overrides. 
   4. `SpatialQuery` has also one method where we could possibly change the API 
to remove the weight parameter as it is not required now. Should I change that 
as well? There is a possibility we may found some more APIs where the weight 
can be removed now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] WIP - Add minimum number of segments to TieredMergePolicy [lucene]

2024-05-31 Thread via GitHub


carlosdelest commented on code in PR #13430:
URL: https://github.com/apache/lucene/pull/13430#discussion_r1620451379


##
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##
@@ -93,6 +93,7 @@ public class TieredMergePolicy extends MergePolicy {
   private double segsPerTier = 10.0;
   private double forceMergeDeletesPctAllowed = 10.0;
   private double deletesPctAllowed = 20.0;
+  private int targetSearchConcurrency = -1;

Review Comment:
   Yes - I iterated a bit on this but now it's equivalent. Thanks!



##
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##
@@ -522,21 +550,28 @@ private MergeSpecification doFindMerges(
 final List candidate = new ArrayList<>();

Review Comment:
   I see some test failures, failing to merge segments that are eligible both 
from a bytes and doc perspective.
   
   It seems related to [this line 
here](https://github.com/apache/lucene/pull/13430/files#diff-da7b621934abf1e7bc74d6bf15f17ddd95cffa4f374776125d8de5cdc0054b2dL507):
   
   ```java
 if (mergeType == MERGE_TYPE.NATURAL
 && sortedEligible.size() <= allowedSegCount
 && remainingDelCount <= allowedDelCount) {
   return spec;
 } 
   ```
   
   Here we're bailing out if we don't have enough eligible segments as the 
allowed seg count.
   
   When adding more allowed segment counts with the target search concurrency, 
this optimization makes it impossible to merge segments that are available for 
merging, and thus get over budget and fail the tests.
   
   I don't get this optimization - why shouldn't we try to merge if we have 
less sorted eligibles? What am I missing to change this so we can take into 
account the bigger number of `allowedSegCount` that comes with adding target 
search concurrency?
   
   



##
lucene/core/src/test/org/apache/lucene/index/TestTieredMergePolicy.java:
##
@@ -431,6 +447,128 @@ public void testForcedMergesRespectSegSize() throws 
Exception {
 dir.close();
   }
 
+  public void testForcedMergesRespectsTargetSearchConcurrency() throws 
Exception {

Review Comment:
   I've given it a try - LMKWYT!



##
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##
@@ -409,6 +433,8 @@ public MergeSpecification findMerges(
 // allowedSegCount may occasionally be less than segsPerTier
 // if segment sizes are below the floor size
 allowedSegCount = Math.max(allowedSegCount, segsPerTier);
+// Override with the targetSearchConcurrency if it is greater
+allowedSegCount = Math.max(allowedSegCount, targetSearchConcurrency);

Review Comment:
   Correct, now the highest tier needs to take that into account. Thanks! 👍 



##
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##
@@ -257,6 +258,25 @@ public double getSegmentsPerTier() {
 return segsPerTier;
   }
 
+  /**
+   * Sets the target search concurrency. This allows merging to ensure that 
there are at least
+   * targetSearchConcurrency segments on the top tier. This setting can be 
overriden by force

Review Comment:
   Makes sense, updating.



##
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##
@@ -581,11 +621,14 @@ private MergeSpecification doFindMerges(
 // whose length is less than the merge factor, it means we are reaching
 // the tail of the list of segments and will only find smaller merges.
 // Stop here.
-if (bestScore != null && hitTooLarge == false && candidate.size() < 
mergeFactor) {
+if (bestScore != null
+&& hitTooLarge == false
+&& hitMaxDocs == false
+&& candidate.size() < mergeFactor) {
   break;
 }
 
-final MergeScore score = score(candidate, hitTooLarge, segInfosSizes);
+final MergeScore score = score(candidate, hitTooLarge || hitMaxDocs, 
segInfosSizes);

Review Comment:
   I see - I was thinking in doing the same with docs than with bytes, but I 
see they're quite different. Docs is more a hint to maintain the number of 
segments we want, but bytes provides a hard limit on size and also score 
guidance on segments to merge.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Reproducible failure TestSimpleTextKnnVectorsFormat.testFloatVectorScorerIteration [lucene]

2024-05-31 Thread via GitHub


ChrisHegarty opened a new issue, #13443:
URL: https://github.com/apache/lucene/issues/13443

   
   ```
   reproduce with: gradlew test --tests 
TestSimpleTextKnnVectorsFormat.testFloatVectorScorerIteration 
-Dtests.seed=D5C839ED84ABC868 -Dtests.locale=kkj-Latn-CM 
-Dtests.timezone=Europe/Belgrade -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   ```
   
   ```
   org.apache.lucene.codecs.simpletext.TestSimpleTextKnnVectorsFormat > 
testFloatVectorScorerIteration FAILED
   java.lang.AssertionError: expected null, but 
was:
   at 
__randomizedtesting.SeedInfo.seed([D5C839ED84ABC868:4E2CA6F8B4C19413]:0)
   at org.junit.Assert.fail(Assert.java:89)
   at org.junit.Assert.failNotNull(Assert.java:756)
   at org.junit.Assert.assertNull(Assert.java:738)
   at org.junit.Assert.assertNull(Assert.java:748)
   at 
org.apache.lucene.tests.index.BaseKnnVectorsFormatTestCase.testFloatVectorScorerIteration(BaseKnnVectorsFormatTestCase.java:767)
   at 
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
   at java.base/java.lang.reflect.Method.invoke(Method.java:580)
   ...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] SimpleText[Float|Byte]VectorValues::scorer should return null when the vector values is empty [lucene]

2024-05-31 Thread via GitHub


ChrisHegarty opened a new pull request, #13444:
URL: https://github.com/apache/lucene/pull/13444

   This commit ensures that `SimpleText[Float|Byte]VectorValues::scorer` 
returns null when the vector values is empty, as per the scorer javadoc.  Other 
KnnVectorsReader implementations have specialised empty implementations that do 
similar, e.g. `OffHeapFloatVectorValues.EmptyOffHeapVectorValues`. The 
`VectorScorer` interface in new in Lucene 9.11, see #13181 
   
   An existing test randomly hits this, but a new test has been added that 
exercises this code path consistently. It's also useful to verify other 
KnnVectorsReader implementations.
   
   closes #13443


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Removed Scorer#getWeight [lucene]

2024-05-31 Thread via GitHub


jpountz commented on PR #13440:
URL: https://github.com/apache/lucene/pull/13440#issuecomment-2141925741

   > We are also removing weight from Subclasses of Scorer as well, right?
   
   Yes, that would be great. If there are a few sub classes that really need a 
Weight, we could keep it, it looks like you identitfied a few of them already.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Removed Scorer#getWeight [lucene]

2024-05-31 Thread via GitHub


jpountz commented on PR #13440:
URL: https://github.com/apache/lucene/pull/13440#issuecomment-2141965592

   > few ctors have been reduced to default ctor. Should I removed those as 
well?
   
   For those that are in the `java` folder, I'd keep them and add javadocs (I 
thought the build checked that). For those under the `test` folder, we can 
remove.
   
   > Check: FunctionValues#getRangeScorer
   
   Good question. Your change looks good this way, but I know Solr is a heavy 
user of `FunctionValues`, I'm curious if that would be annoying to them. 
@gerlowskija @dsmiley @HoustonPutman Do you have an opinion on this?
   
   > SpatialQuery has also one method where we could possibly change the API to 
remove the weight parameter as it is not required now. Should I change that as 
well?
   
   Yes. It looks like it was only there to be able to pass the Weight to a 
Scorer constructor.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Instrument IndexOrDocValuesQuery to report on its decisions [lucene]

2024-05-31 Thread via GitHub


jpountz commented on issue #13442:
URL: https://github.com/apache/lucene/issues/13442#issuecomment-2141990962

   Thinking out loud: I'd like queries to remain as close to value classes as 
possible, just describing an information need. I believe that the change you're 
suggesting would require storing a listener on `IndexOrDocValuesQuery`, which 
would go against this. Maybe we should introduce a more general framework to 
allow queries and collectors to report about some interesting decisions they 
make, and keep the state on e.g. `IndexSearcher` rather than `Query`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] SimpleText[Float|Byte]VectorValues::scorer should return null when the vector values is empty [lucene]

2024-05-31 Thread via GitHub


ChrisHegarty commented on PR #13444:
URL: https://github.com/apache/lucene/pull/13444#issuecomment-2142165781

   This commit ensures that SimpleText[Float|Byte]VectorValues::scorer returns 
null when the vector values is empty, as per the scorer javadoc. Other 
KnnVectorsReader implementations have specialised empty implementations that do 
similar, e.g. OffHeapFloatVectorValues.EmptyOffHeapVectorValues. The 
VectorScorer interface in new in Lucene 9.11, see 
https://github.com/apache/lucene/pull/13181
   
   An existing test randomly hits this, but a new test has been added that 
exercises this code path consistently. It's also useful to verify other 
KnnVectorsReader implementations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Reproducible failure TestSimpleTextKnnVectorsFormat.testFloatVectorScorerIteration [lucene]

2024-05-31 Thread via GitHub


ChrisHegarty closed issue #13443: Reproducible failure 
TestSimpleTextKnnVectorsFormat.testFloatVectorScorerIteration
URL: https://github.com/apache/lucene/issues/13443


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] SimpleText[Float|Byte]VectorValues::scorer should return null when the vector values is empty [lucene]

2024-05-31 Thread via GitHub


ChrisHegarty merged PR #13444:
URL: https://github.com/apache/lucene/pull/13444


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new test case "testGetLines" for lucene/core/analysis/WordlistLoader [lucene]

2024-05-31 Thread via GitHub


hack4chang commented on code in PR #13419:
URL: https://github.com/apache/lucene/pull/13419#discussion_r1621365979


##
lucene/core/src/test/org/apache/lucene/analysis/TestWordlistLoader.java:
##
@@ -77,4 +82,17 @@ public void testSnowballListLoading() throws IOException {
 assertTrue(wordset.contains("six"));
 assertTrue(wordset.contains("seven"));
   }
+
+  public void testGetLines() throws IOException {
+String s = "One \n#Comment \n \n Two \n  Three  \n";
+Charset charset = StandardCharsets.UTF_8;
+byte[] sByteArr = s.getBytes(charset);
+InputStream sInputStream = new ByteArrayInputStream(sByteArr);
+List lines = WordlistLoader.getLines(sInputStream, charset);
+assertEquals(3, lines.size());
+assertFalse(lines.contains("#Comment"));

Review Comment:
   Hi, yeah sounds reasonable, I will make a change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add new dynamic confidence interval configuration to scalar quantized format [lucene]

2024-05-31 Thread via GitHub


benwtrent opened a new pull request, #13445:
URL: https://github.com/apache/lucene/pull/13445

   When int4 scalar quantization was merged, it added a new way to dynamically 
calculate quantiles. 
   
   However, when that was merged, I inadvertently changed the default behavior, 
where a `null` confidenceInterval would actually calculate the dynamic 
quantiles instead of doing the previous auto-setting to `1 - 1/(dim + 1)`.
   
   This commit formalizes the dynamic quantile calculate through setting the 
confidenceInterval to `0`, and preserves the previous behavior for `null` 
confidenceIntervals so that users upgrading will not see different quantiles 
than they would expect. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Removed Scorer#getWeight [lucene]

2024-05-31 Thread via GitHub


dsmiley commented on PR #13440:
URL: https://github.com/apache/lucene/pull/13440#issuecomment-2142590119

   It's not clear why Solr would care with regards to FunctionValues & Weights 
in particular.  I don't notice Solr using Weights there but maybe I'm not 
looking in quite the right spot?
   
   if Scorer.getWeight is being removed, I only see a couple places in Solr 
where this is used, both in LTR, interestingly.  This shouldn't stop Lucene 
from doing what it ought to do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add new dynamic confidence interval configuration to scalar quantized format [lucene]

2024-05-31 Thread via GitHub


benwtrent commented on PR #13445:
URL: https://github.com/apache/lucene/pull/13445#issuecomment-2142591533

   I am planning on putting this in 9.11 as it is also a fix for breaking BWC 
that 9.11 currently does when using the same format configuration between 9.10 
vs. 9.11. In 9.10, supplying a `null` confidence interval indicated a static 
interval of `1 - 1/(dim + 1)`, without this commit, we break this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org