Re: [PR] Speed up TermQuery [lucene]
github-actions[bot] commented on PR #14709: URL: https://github.com/apache/lucene/pull/14709#issuecomment-2908055656 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up TermQuery [lucene]
gf2121 commented on code in PR #14709: URL: https://github.com/apache/lucene/pull/14709#discussion_r2106295517 ## lucene/core/src/java/org/apache/lucene/search/BatchScoreBulkScorer.java: ## @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import org.apache.lucene.util.Bits; + +/** + * A bulk scorer used when {@link ScoreMode#needsScores()} is true and {@link + * Scorer#nextDocsAndScores} has optimizations to run faster than one-by-one iteration. + */ +class BatchScoreBulkScorer extends BulkScorer { + + private final SimpleScorable scorable = new SimpleScorable(); + private final DocAndScoreBuffer buffer = new DocAndScoreBuffer(); + private final Scorer scorer; + + BatchScoreBulkScorer(Scorer scorer) { +this.scorer = scorer; + } + + @Override + public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException { +if (collector.competitiveIterator() != null) { + return new Weight.DefaultBulkScorer(scorer).score(collector, acceptDocs, min, max); +} Review Comment: Thanks for feedback! I moved the impl into `DefaultBulkScorer`. > if (scoreMode == TOP_SCORES && competitiveIterator == null) As description showing, exhaustive execution get optimized as well so i use `scoreMode.needsScores` instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up TermQuery [lucene]
jpountz commented on code in PR #14709: URL: https://github.com/apache/lucene/pull/14709#discussion_r2106282443 ## lucene/core/src/java/org/apache/lucene/search/BatchScoreBulkScorer.java: ## @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import org.apache.lucene.util.Bits; + +/** + * A bulk scorer used when {@link ScoreMode#needsScores()} is true and {@link + * Scorer#nextDocsAndScores} has optimizations to run faster than one-by-one iteration. + */ +class BatchScoreBulkScorer extends BulkScorer { + + private final SimpleScorable scorable = new SimpleScorable(); + private final DocAndScoreBuffer buffer = new DocAndScoreBuffer(); + private final Scorer scorer; + + BatchScoreBulkScorer(Scorer scorer) { +this.scorer = scorer; + } + + @Override + public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException { +if (collector.competitiveIterator() != null) { + return new Weight.DefaultBulkScorer(scorer).score(collector, acceptDocs, min, max); +} Review Comment: I wonder if this should be an implementation detail of `DefaultBulkScorer` instead of a different class. Doing something like ``` if (scoreMode == TOP_SCORES && competitiveIterator == null) { // new optimization } else { // existing DefaultBulkScorer code } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a hint to specify READONCE IOContext [lucene]
jpountz commented on code in PR #14509: URL: https://github.com/apache/lucene/pull/14509#discussion_r2106283643 ## lucene/core/src/java/org/apache/lucene/store/IOContext.java: ## @@ -56,7 +56,7 @@ interface FileOpenHint {} * This context should only be used when the read operations will be performed in the same * thread as the thread that opens the underlying storage. */ - IOContext READONCE = new DefaultIOContext(DataAccessHint.SEQUENTIAL); + IOContext READONCE = new DefaultIOContext(DataAccessHint.SEQUENTIAL, ReadOnceHint.INSTANCE); Review Comment: OK, I had missed that. SEQUENTIAL intuitively doesn't sound like the best option, but let's keep it for now then, I'll open a separate discussion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Refactor main top-n bulk scorers to evaluate hits in a more term-at-a-time fashion. [lucene]
jpountz merged PR #14701: URL: https://github.com/apache/lucene/pull/14701 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix Method declared 'final' in 'final' class in LongHeap. [lucene]
github-actions[bot] commented on PR #14712: URL: https://github.com/apache/lucene/pull/14712#issuecomment-2908407435 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix Method declared 'final' in 'final' class in LongHeap. [lucene]
vsop-479 opened a new pull request, #14712: URL: https://github.com/apache/lucene/pull/14712 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up TermQuery [lucene]
github-actions[bot] commented on PR #14709: URL: https://github.com/apache/lucene/pull/14709#issuecomment-2908065719 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix comment above OnHeapHnswGraph#getNeighbors. [lucene]
vsop-479 opened a new pull request, #14713: URL: https://github.com/apache/lucene/pull/14713 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix comment above OnHeapHnswGraph#getNeighbors. [lucene]
github-actions[bot] commented on PR #14713: URL: https://github.com/apache/lucene/pull/14713#issuecomment-2908433928 This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Remove Telugu normalization of vu వు to ma మ from IndicNormalizer [lucene]
praveen-d291 commented on issue #14659: URL: https://github.com/apache/lucene/issues/14659#issuecomment-2907706053 @rmuir, You're absolutely right; I should have led with this data in my initial comment. My apologies for not providing the "homework" upfront. Here's a direct look at the state of modern Telugu content, which strongly suggests that the issues the IndicNormalizationFilter was designed to address are less prevalent now: 1. **Prevalence of Clean Unicode Text**: I've analyzed several high-volume, real-world Telugu sources, and the trend towards clean Unicode is very clear across these examples: - The official website of the Government of Telangana: https://www.telangana.gov.in/te/ - The Andhra Pradesh Government's Irrigation Department website: https://irrigationap.cgg.gov.in/wrd/home - The Andhra Pradesh Agriculture Department website: https://www.apagrisnet.gov.in/ - A major Telugu news publication like Eenadu: https://www.eenadu.net/ (consistently a top 3 paper by circulation). All content on these sites consistently uses UTF-8 Unicode. Characters like వు (vu) and మ (ma) are rendered distinctly and unambiguously. 2. **Widespread OS-Level Font Support**: The need for "custom fonts from websites" or "janky conversion" is largely gone because popular OS vendors have been bundling robust Telugu font support for over two decades: **Windows**: Gautami has been included since 2001 (https://en.wikipedia.org/wiki/Gautami_(typeface)). Nirmala UI, a comprehensive typeface for Indic scripts, has been bundled since Windows 8 (https://en.wikipedia.org/wiki/Nirmala_UI). **macOS**: macOS Monterey alone includes 15 Telugu fonts (Apple support page: https://support.apple.com/en-in/103203). This widespread, native OS support directly translates to users generally not dealing with systems that require special handling or struggle with complex script rendering for modern Unicode Telugu text. The core issue is that applying the వు to మ conflation by default now introduces a linguistically incorrect loss of precision for the vast majority of current Telugu content. Given this, I want to reiterate the two options I proposed earlier for addressing this: Option 1: Fix the Default (My Preference) I'd propose adding a boolean option to the TeluguAnalyzer constructor to control IndicNormalizationFilter inclusion, and make its default false. This would make TeluguAnalyzer precise right out of the box for modern documents. Users with older, less-formatted text could still explicitly enable it. I believe this is a necessary correction for linguistic accuracy and explicitly documents this conversion. Option 2: Document the behavior in TeluguAnalyzer Alternatively, we could document this specific behavior in the TeluguAnalyzer docs, explaining the వు to మ mapping and how to build a custom analyzer to avoid it. Option 1 feels like the right long-term fix for the default user experience, given the current state of Telugu content. What do you think? I can raise a PR after agreeing on this topic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org