Re: [I] Nightly benchmark regression on 2025.05.01 [lucene]
rmuir commented on issue #14630: URL: https://github.com/apache/lucene/issues/14630#issuecomment-2870192359 > Oh, hmmm, maybe not -- JDK 23 EOL'd. you can still download it the old fashioned way for a test: https://www.oracle.com/java/technologies/javase/jdk23-archive-downloads.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add comment about using InverseIntersectVisit and IntersectVisitor. [lucene]
vsop-479 opened a new pull request, #14647: URL: https://github.com/apache/lucene/pull/14647 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Dynamic threshold for DocIdSetBuilder [lucene]
prudhvigodithi commented on issue #14485: URL: https://github.com/apache/lucene/issues/14485#issuecomment-2870497304 So from my understanding, Instead of creating one large BitSet for the entire segment (sized for maxDoc), the suggestion is to: - Create a smaller BitSet that only covers the specific range of document IDs in a partition. - Use the minDocId and maxDocId of the partition to define this range, something like as follows: ``` int rangeSize = maxDocId - minDocId + 1; this.threshold = rangeSize >>> 7; ``` So now the memory cost of the BitSet would be based on the range size, not maxDoc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] upgrade from rat 0.14 to rat 0.15 [lucene]
rmuir opened a new pull request, #14648: URL: https://github.com/apache/lucene/pull/14648 This upgrade doesn't break our build, seems the API changes that cause issues might begin with 0.16: https://creadur.apache.org/rat/changes-report.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] deps(java): bump org.apache.rat:apache-rat from 0.14 to 0.16.1 [lucene]
rmuir commented on PR #14582: URL: https://github.com/apache/lucene/pull/14582#issuecomment-2870547128 first bumping to 0.15 via #14648 we can rebase the bot after that here. 0.16.x seems like a bigger change based on https://creadur.apache.org/rat/changes-report.html, so there will be more work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Enable changelog verifier [lucene]
stefanvodita opened a new pull request, #14644: URL: https://github.com/apache/lucene/pull/14644 The changelog verifier will start to post comments on PRs and to add milestones. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Create a bot to check if there is a CHANGES entry for new PRs [lucene]
stefanvodita commented on issue #13898: URL: https://github.com/apache/lucene/issues/13898#issuecomment-2870276813 I've been monitoring the jobs after the most recent batch of fixes and I'm happy with the results. The only change that the bot got wrong was #14638 ([logs](https://github.com/apache/lucene/actions/runs/14941689255/job/41979703606)), which moves around a changelog entry from 11.0.0 to 10.3.0. The bot would still assign it milestone 11.0.0, but I think that's not a serious mistake and I'm willing to take a progress-not-perfection approach here. I opened #14644 to start assigning milestones automatically ([test](https://github.com/stefanvodita/lucene/actions/runs/14960617144/job/42022012969?pr=9)) and to post a comment on PRs that don't have a changelog entry ([test](https://github.com/stefanvodita/lucene/actions/runs/14960628233/job/42022036265?pr=9)). If there aren't objections, I will send an announcement to the dev list ahead of pushing this and enabling the bot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add instructions to help/IDEs.txt for VSCode and Neovim [lucene]
rmuir opened a new pull request, #14646: URL: https://github.com/apache/lucene/pull/14646 Both of these use the eclipse language server, so they just leverage existing `gradlew eclipse`. The trick is to disable Eclipse Language Server's built-in gradle integration and just use the .classpath/.settings, otherwise chaos. The same general approach should work for other editors using the language server (Emacs, Helix, Zed, etc) too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use the preload hint on completion fields and memory terms dictionaries. [lucene]
jpountz merged PR #14634: URL: https://github.com/apache/lucene/pull/14634 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up FileTypeHint a bit. [lucene]
jpountz merged PR #14635: URL: https://github.com/apache/lucene/pull/14635 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TopFieldCollector mistakenly assumes that all leaves share the same index sort [lucene]
msokolov commented on issue #14399: URL: https://github.com/apache/lucene/issues/14399#issuecomment-2869828209 Would it make sense to have different collectors for the two cases, one with and one without a cache? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Segment count (merging) can impact recall on KNN ParentJoin queries [lucene]
msokolov commented on issue #14643: URL: https://github.com/apache/lucene/issues/14643#issuecomment-2869830216 sadly, this is expected. It's not only parent-join, but any kind of approximate NN search. Think of the limit where we have as many segments as there are documents, recall will always be 100% because we will perform a "brute force" index scan. If we want to figure out how to maintain the same recall as the index merges, that would be an interesting problem? The pro-rata collection method we've switched to now will tend to reduce the work done per segment as the segments shrink, but it has enough of a buffer that I think we'd still see this effect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Nightly benchmark regression on 2025.05.01 [lucene]
msokolov closed issue #14630: Nightly benchmark regression on 2025.05.01 URL: https://github.com/apache/lucene/issues/14630 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] build(deps): bump ruff from 0.11.7 to 0.11.8 in /dev-tools/scripts [lucene]
rmuir merged PR #14603: URL: https://github.com/apache/lucene/pull/14603 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] deps(java): bump de.jflex:jflex from 1.8.2 to 1.9.1 [lucene]
rmuir commented on code in PR #14583: URL: https://github.com/apache/lucene/pull/14583#discussion_r2083651585 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/classic/ClassicTokenizerImpl.java: ## @@ -438,6 +436,16 @@ public final void setBufferSize(int numChars) { this.zzReader = in; } + /** Returns the maximum size of the scanner buffer, which limits the size of tokens. */ + private int zzMaxBufferLen() { +return Integer.MAX_VALUE; + } + + /** Whether the scanner buffer can grow to accommodate a larger token. */ + private boolean zzCanGrow() { +return true; + } Review Comment: I understand it now. these buffer limits/control are new features that we probably want to adopt (and remove our "skeletons"). first I want to get tests passing before attempting it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] investigate jflex 1.9.x buffer size/expansion feature [lucene]
rmuir opened a new issue, #14645: URL: https://github.com/apache/lucene/issues/14645 ### Description #14583 only bumps the dependency and regenerates, but doesn't take advantage of the new features. I think we are currently taking care of this with skeleton files in `gradle/generation/jflex`. Using the builtin functionality might be cleaner. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] deps(java): bump de.jflex:jflex from 1.8.2 to 1.9.1 [lucene]
rmuir merged PR #14583: URL: https://github.com/apache/lucene/pull/14583 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] build(deps): bump ruff from 0.11.7 to 0.11.8 in /dev-tools/scripts [lucene]
rmuir commented on PR #14603: URL: https://github.com/apache/lucene/pull/14603#issuecomment-2870383819 @dependabot rebase -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Segment count (merging) can impact recall on KNN ParentJoin queries [lucene]
jpountz commented on issue #14643: URL: https://github.com/apache/lucene/issues/14643#issuecomment-2870180042 Why are the recall values so bad with parent-join queries (whether merging is enabled or not)? Is there a bug? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Promote sandbox facets to the main facets module [lucene]
jpountz commented on issue #14619: URL: https://github.com/apache/lucene/issues/14619#issuecomment-2870187602 Facets already put the burden of choosing between taxonomy and doc-value-based faceting on users. If we introduce a new approach for faceting, I worry that it would make things even worse: if a user wants to compute facets in their application, what should they use? I personally like the new faceting approach better, in particular it doesn't use O(maxDoc) heap to store a bit set, it allows collectors to give feedback to the query about which docs they care about (`LeafCollector#competitiveIterator()`). But I'm also not as familiar with the faceting module as @gsmiller or @mikemccand, so I'm curious: is there consensus that this new approach for faceting should eventually replace the existing one, or do we anticipate both to keep developing and serve different purposes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org