Re: [PR] Add Bulk Scorer For ToParentBlockJoinQuery [lucene]

2024-09-16 Thread via GitHub


javanna commented on PR #13697:
URL: https://github.com/apache/lucene/pull/13697#issuecomment-2352173213

   There's a couple of recent test failures, in main as well as 9x, that may 
have to do with this change, judging from the area that it touches:
   
   ```
   FAILED:  
org.apache.lucene.search.join.TestBlockJoinBulkScorer.testSetMinCompetitiveScoreWithScoreModeMax
   
   Error Message:
   java.lang.AssertionError: expected:<{16=5.0, 10=10.0, 2=6.0}> but 
was:<{2=6.0, 10=10.0}>
   
   Stack Trace:
   java.lang.AssertionError: expected:<{16=5.0, 10=10.0, 2=6.0}> but 
was:<{2=6.0, 10=10.0}>
   at 
__randomizedtesting.SeedInfo.seed([8531AA8A82E0A86C:D0DE9F82717ED75C]:0)
   at app/junit@4.13.1/org.junit.Assert.fail(Assert.java:89)
   at app/junit@4.13.1/org.junit.Assert.failNotEquals(Assert.java:835)
   at app/junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:120)
   at app/junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:146)
   at 
app//org.apache.lucene.search.join.TestBlockJoinBulkScorer.assertScores(TestBlockJoinBulkScorer.java:301)
   at 
app//org.apache.lucene.search.join.TestBlockJoinBulkScorer.testSetMinCompetitiveScoreWithScoreModeMax(TestBlockJoinBulkScorer.java:397)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Bulk Scorer For ToParentBlockJoinQuery [lucene]

2024-09-16 Thread via GitHub


jpountz commented on PR #13697:
URL: https://github.com/apache/lucene/pull/13697#issuecomment-2352178493

   FYI @Mikep86 opened a PR at https://github.com/apache/lucene/pull/13785.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix Flaky Test In TestBlockJoinBulkScorer [lucene]

2024-09-16 Thread via GitHub


jpountz commented on code in PR #13785:
URL: https://github.com/apache/lucene/pull/13785#discussion_r1760649790


##
lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoinBulkScorer.java:
##
@@ -283,7 +292,33 @@ public void collect(int doc) throws IOException {
 null,
 0,
 NO_MORE_DOCS);
-assertEquals(expectedScores, actualScores);
+
+if (expectedScoresList.size() == 1) {
+  assertEquals(expectedScoresList.getFirst(), actualScores);
+} else {
+  assertEqualsToOneOf(expectedScoresList, actualScores);
+}
+  }
+
+  private static void assertEqualsToOneOf(List expectedList, Object actual) 
{
+boolean foundMatch = false;
+for (Object expected : expectedList) {
+  if (expected == null) {
+if (actual == null) {
+  foundMatch = true;
+  break;
+}
+  } else {
+foundMatch = expected.equals(actual);
+if (foundMatch) {
+  break;
+}
+  }

Review Comment:
   nit: maybe merge these two conditions under a single `if 
(Objects.equals(expected, actual))` which would take care of the `null` case?



##
lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoinBulkScorer.java:
##
@@ -283,7 +292,33 @@ public void collect(int doc) throws IOException {
 null,
 0,
 NO_MORE_DOCS);
-assertEquals(expectedScores, actualScores);
+
+if (expectedScoresList.size() == 1) {
+  assertEquals(expectedScoresList.getFirst(), actualScores);
+} else {
+  assertEqualsToOneOf(expectedScoresList, actualScores);
+}

Review Comment:
   I'm curious why you don't call `assertEqualsToOneOf` all the time, shouldn't 
it also work in the size==1 case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up advancing within a block. [lucene]

2024-09-16 Thread via GitHub


jpountz commented on PR #13692:
URL: https://github.com/apache/lucene/pull/13692#issuecomment-2352315844

   Woops, sorry I'm only seeing your reply now. The above analysis you referred 
to uses branchless binary search over the full buffer of 128 doc IDs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up advancing within a block. [lucene]

2024-09-16 Thread via GitHub


jpountz commented on PR #13692:
URL: https://github.com/apache/lucene/pull/13692#issuecomment-2352330831

   I reverted this change. While it was a good win on average on my machine, it 
was almost a net loss on nightly benchmarks. So I'd rather keep the current 
linear scan approach, which has a lower overhead when creating new 
`PostingsEnum` objects and is simpler.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Deprecate BulkScorer#score(LeafReaderContext, Bits) [lucene]

2024-09-16 Thread via GitHub


javanna opened a new pull request, #13794:
URL: https://github.com/apache/lucene/pull/13794

   We have removed BulkScorer#score(LeafReaderContext, Bits) in main in favour 
of BulkScorer#score(LeafCollector collector, Bits acceptDocs, int min, int max) 
as part of #13542. This commit deprecates the method in 9.x. Internal usages of 
it are left untouched as there may be subclasses that override it, which we 
don't want to break in a minor release.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-16 Thread via GitHub


javanna commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1760803091


##
lucene/core/src/java/org/apache/lucene/search/BulkScorer.java:
##
@@ -27,18 +27,6 @@
  */
 public abstract class BulkScorer {
 
-  /**
-   * Scores and collects all matching documents.
-   *
-   * @param collector The collector to which all matching documents are passed.
-   * @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
-   * if they are all allowed to match.
-   */
-  public void score(LeafCollector collector, Bits acceptDocs) throws 
IOException {

Review Comment:
   I opened this after all: https://github.com/apache/lucene/pull/13794 . It 
deprecates the method but leaves its usages untouched to avoid breaking users 
that override it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Deprecate BulkScorer#score(LeafReaderContext, Bits) [lucene]

2024-09-16 Thread via GitHub


jpountz commented on code in PR #13794:
URL: https://github.com/apache/lucene/pull/13794#discussion_r1760821555


##
lucene/core/src/java/org/apache/lucene/search/BulkScorer.java:
##
@@ -33,7 +33,11 @@ public abstract class BulkScorer {
* @param collector The collector to which all matching documents are passed.
* @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
* if they are all allowed to match.
+   * @deprecated in favour of {@link #score(LeafCollector, Bits, int, int)}. 
Callers should instead
+   * call the method variant that takes min and max as arguments. 
Subclasses that override it

Review Comment:
   maybe also suggest passing min=0 and max=DocIdSetIterator.NO_MORE_DOCS here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Deprecate BulkScorer#score(LeafReaderContext, Bits) [lucene]

2024-09-16 Thread via GitHub


javanna commented on code in PR #13794:
URL: https://github.com/apache/lucene/pull/13794#discussion_r1760852834


##
lucene/core/src/java/org/apache/lucene/search/BulkScorer.java:
##
@@ -33,7 +33,11 @@ public abstract class BulkScorer {
* @param collector The collector to which all matching documents are passed.
* @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
* if they are all allowed to match.
+   * @deprecated in favour of {@link #score(LeafCollector, Bits, int, int)}. 
Callers should instead
+   * call the method variant that takes min and max as arguments. 
Subclasses that override it

Review Comment:
   good idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we auto-adjust top score doc and top field collector manager based on slices? [lucene]

2024-09-16 Thread via GitHub


javanna commented on issue #13791:
URL: https://github.com/apache/lucene/issues/13791#issuecomment-2352478375

   ++ to that @jpountz that would also be my preferred approach. I somehow made 
the assumption that we are not willing to go that route, but I think we should 
revisit that decision, assuming the slowdown is small. It feels wrong for sure 
that users have to know about internal execution when creating a collector 
manager. Ideally, providing slices to adapt the behaviour would not be 
necessary either.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Deprecate BulkScorer#score(LeafReaderContext, Bits) [lucene]

2024-09-16 Thread via GitHub


javanna merged PR #13794:
URL: https://github.com/apache/lucene/pull/13794


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Deprecate BulkScorer#score(LeafReaderContext, Bits) [lucene]

2024-09-16 Thread via GitHub


javanna commented on PR #13794:
URL: https://github.com/apache/lucene/pull/13794#issuecomment-2352510100

   Thanks @jpountz !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove CollectorManager#forSequentialExecution [lucene]

2024-09-16 Thread via GitHub


javanna commented on PR #13790:
URL: https://github.com/apache/lucene/pull/13790#issuecomment-2352511247

   @gsmiller how do you feel about this one? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for intra-segment search concurrency [lucene]

2024-09-16 Thread via GitHub


javanna commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1760911695


##
lucene/facet/src/java/org/apache/lucene/facet/FacetsCollector.java:
##
@@ -97,12 +97,12 @@ public List getMatchingDocs() {
   public void collect(int doc) throws IOException {
 docsBuilder.grow(1).add(doc);
 if (keepScores) {
-  if (totalHits >= scores.length) {
-float[] newScores = new float[ArrayUtil.oversize(totalHits + 1, 4)];
-System.arraycopy(scores, 0, newScores, 0, totalHits);
+  if (doc >= scores.length) {
+float[] newScores = new float[ArrayUtil.oversize(doc + 1, 4)];
+System.arraycopy(scores, 0, newScores, 0, doc);
 scores = newScores;
   }
-  scores[totalHits] = scorer.score();
+  scores[doc] = scorer.score();

Review Comment:
   thanks for looking! I noticed that we document the scores array as 
non-sparse and this change makes it indeed sparse. I think I could have been 
smarter here and keep it non-sparse. I will try to address this. It feels bad 
that we allocate such a big array and keep on resizing it as well. We should 
rather then build in advance knowing how many docs the leaf has. I will see 
what I can do about this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix Flaky Test In TestBlockJoinBulkScorer [lucene]

2024-09-16 Thread via GitHub


Mikep86 commented on code in PR #13785:
URL: https://github.com/apache/lucene/pull/13785#discussion_r1761059326


##
lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoinBulkScorer.java:
##
@@ -283,7 +292,33 @@ public void collect(int doc) throws IOException {
 null,
 0,
 NO_MORE_DOCS);
-assertEquals(expectedScores, actualScores);
+
+if (expectedScoresList.size() == 1) {
+  assertEquals(expectedScoresList.getFirst(), actualScores);
+} else {
+  assertEqualsToOneOf(expectedScoresList, actualScores);
+}

Review Comment:
   It should and I originally did that, but if you use `assertEquals` there are 
nicer IDE integrations for value comparison on failure. Given that this method 
is called with a single map of expected scores in nearly all cases, I tried to 
preserve that functionality. This could help in the case of a failure in 
`testScoreRandomIndices`, which could have a large random score map.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix Flaky Test In TestBlockJoinBulkScorer [lucene]

2024-09-16 Thread via GitHub


jpountz commented on code in PR #13785:
URL: https://github.com/apache/lucene/pull/13785#discussion_r1761068214


##
lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoinBulkScorer.java:
##
@@ -283,7 +292,33 @@ public void collect(int doc) throws IOException {
 null,
 0,
 NO_MORE_DOCS);
-assertEquals(expectedScores, actualScores);
+
+if (expectedScoresList.size() == 1) {
+  assertEquals(expectedScoresList.getFirst(), actualScores);
+} else {
+  assertEqualsToOneOf(expectedScoresList, actualScores);
+}

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix Flaky Test In TestBlockJoinBulkScorer [lucene]

2024-09-16 Thread via GitHub


jpountz merged PR #13785:
URL: https://github.com/apache/lucene/pull/13785


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make Operations#optional create simpler automata. [lucene]

2024-09-16 Thread via GitHub


jpountz merged PR #13793:
URL: https://github.com/apache/lucene/pull/13793


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] "cz" (vs ISO langauge code "cs") for Czech analysis package? [LUCENE-6366] [lucene]

2024-09-16 Thread via GitHub


WEBCON-BPS-DEV commented on issue #7426:
URL: https://github.com/apache/lucene/issues/7426#issuecomment-2352942882

   Is there any update on the issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-16 Thread via GitHub


msokolov commented on PR #13779:
URL: https://github.com/apache/lucene/pull/13779#issuecomment-2353036120

   I pushed a new revision here addressing some of the major comments:
   
   1. `KnnVectorValues.iterator()` now generally provides a new iterator; no 
caching is done. I removed `createIterator()`. Main impact was on VectorScorer 
(and in tests) where we now create iterators and store them locally. This is 
much better; thanks for the feedback.
   2. I added implementations for `advance()` and got rid of the default impl.
   3. I *removed* impls of `cost()` and added a default impl that throws UOE. 
This method is only ever used during search() and most of these values sources 
will never be searched. The exceptions are those that can be used by the 
ValueSource API: basically the indexed values returned by a reader. We have 
lots and lots of other values impls that are used during indexing for which we 
don't need cost. I briefly considered separating these new iterators from DISI, 
but that ended up in some trouble.
   4. re: `getVectorByteLength()` @ChrisHegarty is correct that this is needed 
as it is today. We could in theory make it final (or inline it whatever) if we 
added some more VectorEncodings to represent the compressed cases, but I'm 
inclined to leave it as is. This way we could in theory support a variable size 
encoding? And anyway it isn't clear we want to mix up the "encoding" with 
compression?
   
   I didn't have a chance to look seriously at removing `copy()` API. I don't 
think we ought to create a simple wrapper though since afaict it would require 
an additional memory copy of every vector value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-16 Thread via GitHub


msokolov commented on PR #13779:
URL: https://github.com/apache/lucene/pull/13779#issuecomment-2353098829

   OK there seem to be some test failures ... I did a complete run, but 
randomized testing always seems to ferret out something interesting!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Extended spell checker with phrase support and adaptive user session analysis. [LUCENE-626] [lucene]

2024-09-16 Thread via GitHub


Menahali commented on issue #1701:
URL: https://github.com/apache/lucene/issues/1701#issuecomment-2353104696

   > Karl Wettin ([migrated from 
JIRA](https://issues.apache.org/jira/browse/LUCENE-626?focusedCommentId=12477688&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12477688))
   > 
   > Nicolas Lalevée [03/Mar/07 01:04 PM]
   > > This feature looks interesting, but why should it depend on #1628 ? 
   > 
   > It use the Index (notification, unison index factory methods, et c.) and 
IndexFacade (cache, fresh reader/searcher et c.) available in that patch. And 
by doing that, it also enables me to use InstantiatedIndex for the a priori 
corpus and ngram index to speed up the response time even more.
   > 
   > 
   
   I can't mack any thing in my program the baker closed every thing in my 
githup program


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] SpanOrQuery uses IDFs of failed subqueries in score calculation. [lucene]

2024-09-16 Thread via GitHub


tkarampAlpha opened a new issue, #13796:
URL: https://github.com/apache/lucene/issues/13796

   ### Description
   
   It seems that for SpanOrQuery IDF of terms belonging in subqueries that will 
not match a given document, will affect said document's score.
   
   I have observed this through on which I have 3 documents:
   
   ```
   doc1: 
   field: something
   doc2:
   field: nothing
   doc3: 
   field: anything
   ```
   
   And I issue the following query:
   
   ```spanOr([Contents:something, Contents:nothing])```
   
   If you check at the score explanation you will notice that in both 
document's score the idf of both terms affects it even though for each document 
only one matches.
   
   This is an example of the explanation of the first document's score:
   ```
   3.9616547 = weight(spanOr([Contents:something, Contents:nothing]) in 0) 
[AsBM25Similarity], result of:
 3.9616547 = score(freq=1.0), computed as boost * idf * tf from:
   51.0 = boost
   3.9616585 = idf, sum of:
 1.9808292 = idf for term nothing , computed as log(1 + (docCount - 
docFreq + 0.5) / (docFreq + 0.5)) + 1 from:
   1 = docFreq
   3 = docCount
 1.9808292 = idf for term something , computed as log(1 + (docCount - 
docFreq + 0.5) / (docFreq + 0.5)) + 1 from:
   1 = docFreq
   3 = docCount
   0.019607842 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / 
avgdl)) from:
 1.0 = phraseFreq=1.0
 50.0 = k1, term saturation parameter
 0.0 = b, length normalization parameter
 1.0 = dl, length of field
 2.0 = avgdl, average length of field
   ```
   
   
   ### Version and environment details
   
   lucene 9.7.0 through solr 9.3.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-16 Thread via GitHub


msokolov commented on PR #13779:
URL: https://github.com/apache/lucene/pull/13779#issuecomment-2353215032

   Regarding the rename of `fromOrdToDoc` to `all` I think it was not helpful 
and plan to revert or maybe come up with some other name. The problem is we 
also have `createDenseIterator` which is also `all`. Essentially we have Sparse 
and Dense all-iterators. Maybe instead of fromOrdToDoc we can say 
`createSparseIterator`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-16 Thread via GitHub


jpountz commented on PR #13779:
URL: https://github.com/apache/lucene/pull/13779#issuecomment-2353373827

   FWIW I started playing with removing copy() by replacing it with a factory 
method for a dictionary: 
https://github.com/msokolov/lucene/commit/ae7aca32a690a4b21a3da793258ce17560b551e7.
 Not sure how far I'll go. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-16 Thread via GitHub


msokolov commented on PR #13779:
URL: https://github.com/apache/lucene/pull/13779#issuecomment-2353397543

   I also just started trying to replace `copy()` with the approach of adding 
`vectorValue(int ord, float[] outValue)` although this does add a copy 
operation in some cases where previously we would expose internal storage so 
I'm not sure it's great


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up advancing within a block. [lucene]

2024-09-16 Thread via GitHub


mikemccand commented on PR #13692:
URL: https://github.com/apache/lucene/pull/13692#issuecomment-2353416849

   > I plotted the number of docs that queries need to skip within a block when 
advancing
   
   This is a really cool chart!
   
   Maybe we could somehow dynamically optimize, picking the right search 
(linear, vs (branchless) binary) depending on the stats/behavior of the actual 
blocks we've encountered so far?  It'd be a simple form of query optimization 
... but that's maybe more code complexity than is worth it ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix Flaky Test In TestBlockJoinBulkScorer [lucene]

2024-09-16 Thread via GitHub


javanna commented on PR #13785:
URL: https://github.com/apache/lucene/pull/13785#issuecomment-2353671468

   Thanks for fixing this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove CollectorManager#forSequentialExecution [lucene]

2024-09-16 Thread via GitHub


gsmiller commented on PR #13790:
URL: https://github.com/apache/lucene/pull/13790#issuecomment-2354128424

   tl;dr: I agree with removing this. 
   
   I was [initially 
hesitant](https://github.com/apache/lucene/pull/13735#issuecomment-2338340094) 
to add this for a lot of the same reasons, but was convinced when realizing we 
could at least explicitly fail by asserting that more than one collector is 
never created (so at least we wouldn't be silently creating tear-your-hair-out 
concurrency bugs). But looking at the bigger picture, the whole point of 
removing `IndexSearcher#search(Query, Collector)` is to help users avoid traps 
situations where they think they're setup for concurrent search but not 
utilizing it. I agree it's a worse trap in a lot of ways to introduce this 
failure mode. I particularly do not like that this "helper" collector manager 
can't _actually_ check if it's being used with a concurrency-ready 
IndexSearcher. So I can imagine a real failure case where IndexSearchers are 
being created with an Executor but only have a single segment and things appear 
to work—but then break later (e.g., users that are force-mergi
 ng their segments, or users that have sparse test coverage that only includes 
testing over single segments, etc.).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-09-16 Thread via GitHub


vigyasharma commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2354135495

   I wonder if we can leverage IndexWriter's `addIndexes(Directory... dirs)` 
[API](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L2984)
 for this. We could create separate indexes for every category (log groups 2xx, 
4xx, 5xx in the example here), and combine them into one using this API. 
Internally, this version of the API simply copies over all segment files in the 
directory, so it should be pretty fast.
   
   This could mean that each shard for an OpenSearch/Elasticsearch index would 
maintain internal indexes for each desired category, and use the API to combine 
them into a common "shard" index at every flush? We'd still need a way to 
maintain category labels for a segment during merging, but that's a common 
problem for any approach we take.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-16 Thread via GitHub


navneet1v commented on code in PR #13779:
URL: https://github.com/apache/lucene/pull/13779#discussion_r1762496963


##
lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java:
##
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.document.KnnByteVectorField;
+import org.apache.lucene.document.KnnFloatVectorField;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+
+/**
+ * This class abstracts addressing of document vector values indexed as {@link 
KnnFloatVectorField}
+ * or {@link KnnByteVectorField}.
+ *
+ * @lucene.experimental
+ */
+public abstract class KnnVectorValues {
+
+  /** The iterator associated with these values. */
+  protected DocIndexIterator iterator;
+
+  /** Return the dimension of the vectors */
+  public abstract int dimension();
+
+  /**
+   * Return the number of vectors for this field.
+   *
+   * @return the number of vectors returned by this iterator
+   */
+  public abstract int size();
+
+  /**
+   * Return the docid of the document indexed with the given vector ordinal. 
This default
+   * implementation returns the argument and is appropriate for dense values 
implementations where
+   * every doc has a single value.
+   */
+  public int ordToDoc(int ord) {
+return ord;
+  }
+
+  /**
+   * Creates a new copy of this {@link KnnVectorValues}. This is helpful when 
you need to access
+   * different values at once, to avoid overwriting the underlying vector 
returned.
+   */
+  public abstract KnnVectorValues copy() throws IOException;
+
+  /** Returns the vector byte length, defaults to dimension multiplied by 
float byte size */
+  public int getVectorByteLength() {
+return dimension() * getEncoding().byteSize;
+  }
+
+  /** The vector encoding of these values. */
+  public abstract VectorEncoding getEncoding();
+
+  /** Returns a Bits accepting docs accepted by the argument and having a 
vector value */
+  public Bits getAcceptOrds(Bits acceptDocs) {
+// FIXME: change default to return acceptDocs and provide this impl
+// somewhere more specialized (in every non-dense impl).
+if (acceptDocs == null) {
+  return null;
+}
+return new Bits() {
+  @Override
+  public boolean get(int index) {
+return acceptDocs.get(ordToDoc(index));
+  }
+
+  @Override
+  public int length() {
+return size();
+  }
+};
+  }
+
+  /**
+   * Return the iterator for this instance. If you need multiple iterators, 
call 
+   * this.copy().iterator().
+   */
+  public DocIndexIterator iterator() {
+if (iterator == null) {
+  iterator = createIterator();
+}
+return iterator;
+  }
+
+  /**
+   * Create an iterator for this instance; typically called once by 
iterator(). Wrapper
+   * value classes delegate to their inner instance's iterator and shouldn't 
implement this.
+   */
+  protected DocIndexIterator createIterator() {
+throw new UnsupportedOperationException();
+  }
+
+  /**
+   * A DocIdSetIterator that also provides an index() method tracking a 
distinct ordinal for a
+   * vector associated with each doc.
+   */
+  public abstract static class DocIndexIterator extends DocIdSetIterator {
+
+/** return the value index (aka "ordinal" or "ord") corresponding to the 
current doc */
+public abstract int index();
+
+@Override
+public int advance(int target) throws IOException {
+  return slowAdvance(target);
+}
+
+@Override
+public long cost() {
+  throw new UnsupportedOperationException();

Review Comment:
   +1. Even in FloatVecotorValues cost() is returning size() only. 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/FloatVectorValues.java#L46-L48



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.