Re: [I] Nightly benchmark regression on 2025.05.01 [lucene]

2025-05-11 Thread via GitHub


rmuir commented on issue #14630:
URL: https://github.com/apache/lucene/issues/14630#issuecomment-2870192359

   > Oh, hmmm, maybe not -- JDK 23 EOL'd.
   
   you can still download it the old fashioned way for a test: 
https://www.oracle.com/java/technologies/javase/jdk23-archive-downloads.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add comment about using InverseIntersectVisit and IntersectVisitor. [lucene]

2025-05-11 Thread via GitHub


vsop-479 opened a new pull request, #14647:
URL: https://github.com/apache/lucene/pull/14647

   ### Description
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Dynamic threshold for DocIdSetBuilder [lucene]

2025-05-11 Thread via GitHub


prudhvigodithi commented on issue #14485:
URL: https://github.com/apache/lucene/issues/14485#issuecomment-2870497304

   So from my understanding, Instead of creating one large BitSet for the 
entire segment (sized for maxDoc), the suggestion is to:
   
   - Create a smaller BitSet that only covers the specific range of document 
IDs in a partition.
   - Use the minDocId and maxDocId of the partition to define this range, 
something like as follows:
   
   ```
   int rangeSize = maxDocId - minDocId + 1;
   this.threshold = rangeSize >>> 7; 
   ```
   
   So now the memory cost of the BitSet would be based on the range size, not 
maxDoc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] upgrade from rat 0.14 to rat 0.15 [lucene]

2025-05-11 Thread via GitHub


rmuir opened a new pull request, #14648:
URL: https://github.com/apache/lucene/pull/14648

   This upgrade doesn't break our build, seems the API changes that cause 
issues might begin with 0.16:
   
   https://creadur.apache.org/rat/changes-report.html
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] deps(java): bump org.apache.rat:apache-rat from 0.14 to 0.16.1 [lucene]

2025-05-11 Thread via GitHub


rmuir commented on PR #14582:
URL: https://github.com/apache/lucene/pull/14582#issuecomment-2870547128

   first bumping to 0.15 via #14648
   
   we can rebase the bot after that here.
   
   0.16.x seems like a bigger change based on 
https://creadur.apache.org/rat/changes-report.html, so there will be more work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Enable changelog verifier [lucene]

2025-05-11 Thread via GitHub


stefanvodita opened a new pull request, #14644:
URL: https://github.com/apache/lucene/pull/14644

   The changelog verifier will start to post comments on PRs and to add 
milestones.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Create a bot to check if there is a CHANGES entry for new PRs [lucene]

2025-05-11 Thread via GitHub


stefanvodita commented on issue #13898:
URL: https://github.com/apache/lucene/issues/13898#issuecomment-2870276813

   I've been monitoring the jobs after the most recent batch of fixes and I'm 
happy with the results. The only change that the bot got wrong was #14638 
([logs](https://github.com/apache/lucene/actions/runs/14941689255/job/41979703606)),
 which moves around a changelog entry from 11.0.0 to 10.3.0. The bot would 
still assign it milestone 11.0.0, but I think that's not a serious mistake and 
I'm willing to take a progress-not-perfection approach here. I opened #14644 to 
start assigning milestones automatically 
([test](https://github.com/stefanvodita/lucene/actions/runs/14960617144/job/42022012969?pr=9))
 and to post a comment on PRs that don't have a changelog entry 
([test](https://github.com/stefanvodita/lucene/actions/runs/14960628233/job/42022036265?pr=9)).
 If there aren't objections, I will send an announcement to the dev list ahead 
of pushing this and enabling the bot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add instructions to help/IDEs.txt for VSCode and Neovim [lucene]

2025-05-11 Thread via GitHub


rmuir opened a new pull request, #14646:
URL: https://github.com/apache/lucene/pull/14646

   Both of these use the eclipse language server, so they just leverage 
existing `gradlew eclipse`.
   
   The trick is to disable Eclipse Language Server's built-in gradle 
integration and just use the .classpath/.settings, otherwise chaos.
   
   The same general approach should work for other editors using the language 
server (Emacs, Helix, Zed, etc) too.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use the preload hint on completion fields and memory terms dictionaries. [lucene]

2025-05-11 Thread via GitHub


jpountz merged PR #14634:
URL: https://github.com/apache/lucene/pull/14634


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Clean up FileTypeHint a bit. [lucene]

2025-05-11 Thread via GitHub


jpountz merged PR #14635:
URL: https://github.com/apache/lucene/pull/14635


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] TopFieldCollector mistakenly assumes that all leaves share the same index sort [lucene]

2025-05-11 Thread via GitHub


msokolov commented on issue #14399:
URL: https://github.com/apache/lucene/issues/14399#issuecomment-2869828209

   Would it make sense to have different collectors for the two cases, one with 
and one without a cache?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Segment count (merging) can impact recall on KNN ParentJoin queries [lucene]

2025-05-11 Thread via GitHub


msokolov commented on issue #14643:
URL: https://github.com/apache/lucene/issues/14643#issuecomment-2869830216

   sadly, this is expected. It's not only parent-join, but any kind of 
approximate NN search. Think of the limit where we have as many segments as 
there are documents, recall will always be 100% because we will perform a 
"brute force" index scan.
   
   If we want to figure out how to maintain the same recall as the index 
merges, that would be an interesting problem? The pro-rata collection method 
we've switched to now will tend to reduce the work done per segment as the 
segments shrink, but it has enough of a buffer that I think we'd still see this 
effect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Nightly benchmark regression on 2025.05.01 [lucene]

2025-05-11 Thread via GitHub


msokolov closed issue #14630: Nightly benchmark regression on 2025.05.01
URL: https://github.com/apache/lucene/issues/14630


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] build(deps): bump ruff from 0.11.7 to 0.11.8 in /dev-tools/scripts [lucene]

2025-05-11 Thread via GitHub


rmuir merged PR #14603:
URL: https://github.com/apache/lucene/pull/14603


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] deps(java): bump de.jflex:jflex from 1.8.2 to 1.9.1 [lucene]

2025-05-11 Thread via GitHub


rmuir commented on code in PR #14583:
URL: https://github.com/apache/lucene/pull/14583#discussion_r2083651585


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/classic/ClassicTokenizerImpl.java:
##
@@ -438,6 +436,16 @@ public final void setBufferSize(int numChars) {
 this.zzReader = in;
   }
 
+  /** Returns the maximum size of the scanner buffer, which limits the size of 
tokens. */
+  private int zzMaxBufferLen() {
+return Integer.MAX_VALUE;
+  }
+
+  /** Whether the scanner buffer can grow to accommodate a larger token. */
+  private boolean zzCanGrow() {
+return true;
+  }

Review Comment:
   I understand it now. these buffer limits/control are new features that we 
probably want to adopt (and remove our "skeletons"). first I want to get tests 
passing before attempting it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] investigate jflex 1.9.x buffer size/expansion feature [lucene]

2025-05-11 Thread via GitHub


rmuir opened a new issue, #14645:
URL: https://github.com/apache/lucene/issues/14645

   ### Description
   
   #14583 only bumps the dependency and regenerates, but doesn't take advantage 
of the new features. I think we are currently taking care of this with skeleton 
files in `gradle/generation/jflex`. Using the builtin functionality might be 
cleaner.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] deps(java): bump de.jflex:jflex from 1.8.2 to 1.9.1 [lucene]

2025-05-11 Thread via GitHub


rmuir merged PR #14583:
URL: https://github.com/apache/lucene/pull/14583


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] build(deps): bump ruff from 0.11.7 to 0.11.8 in /dev-tools/scripts [lucene]

2025-05-11 Thread via GitHub


rmuir commented on PR #14603:
URL: https://github.com/apache/lucene/pull/14603#issuecomment-2870383819

   @dependabot rebase


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Segment count (merging) can impact recall on KNN ParentJoin queries [lucene]

2025-05-11 Thread via GitHub


jpountz commented on issue #14643:
URL: https://github.com/apache/lucene/issues/14643#issuecomment-2870180042

   Why are the recall values so bad with parent-join queries (whether merging 
is enabled or not)? Is there a bug?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Promote sandbox facets to the main facets module [lucene]

2025-05-11 Thread via GitHub


jpountz commented on issue #14619:
URL: https://github.com/apache/lucene/issues/14619#issuecomment-2870187602

   Facets already put the burden of choosing between taxonomy and 
doc-value-based faceting on users. If we introduce a new approach for faceting, 
I worry that it would make things even worse: if a user wants to compute facets 
in their application, what should they use?
   
   I personally like the new faceting approach better, in particular it doesn't 
use O(maxDoc) heap to store a bit set, it allows collectors to give feedback to 
the query about which docs they care about 
(`LeafCollector#competitiveIterator()`). But I'm also not as familiar with the 
faceting module as @gsmiller or @mikemccand, so I'm curious: is there consensus 
that this new approach for faceting should eventually replace the existing one, 
or do we anticipate both to keep developing and serve different purposes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org