Re: [PR] Speed up TermQuery [lucene]

2025-05-25 Thread via GitHub


github-actions[bot] commented on PR #14709:
URL: https://github.com/apache/lucene/pull/14709#issuecomment-2908055656

   This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. 
If the PR doesn't need a changelog entry, then add the skip-changelog-check 
label to it and you will stop receiving this reminder on future updates to the 
PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up TermQuery [lucene]

2025-05-25 Thread via GitHub


gf2121 commented on code in PR #14709:
URL: https://github.com/apache/lucene/pull/14709#discussion_r2106295517


##
lucene/core/src/java/org/apache/lucene/search/BatchScoreBulkScorer.java:
##
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import org.apache.lucene.util.Bits;
+
+/**
+ * A bulk scorer used when {@link ScoreMode#needsScores()} is true and {@link
+ * Scorer#nextDocsAndScores} has optimizations to run faster than one-by-one 
iteration.
+ */
+class BatchScoreBulkScorer extends BulkScorer {
+
+  private final SimpleScorable scorable = new SimpleScorable();
+  private final DocAndScoreBuffer buffer = new DocAndScoreBuffer();
+  private final Scorer scorer;
+
+  BatchScoreBulkScorer(Scorer scorer) {
+this.scorer = scorer;
+  }
+
+  @Override
+  public int score(LeafCollector collector, Bits acceptDocs, int min, int max) 
throws IOException {
+if (collector.competitiveIterator() != null) {
+  return new Weight.DefaultBulkScorer(scorer).score(collector, acceptDocs, 
min, max);
+}

Review Comment:
   Thanks for feedback! I moved the impl into `DefaultBulkScorer`.
   
   > if (scoreMode == TOP_SCORES && competitiveIterator == null)
   
   As description showing, exhaustive execution get optimized as well so i use 
`scoreMode.needsScores` instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up TermQuery [lucene]

2025-05-25 Thread via GitHub


jpountz commented on code in PR #14709:
URL: https://github.com/apache/lucene/pull/14709#discussion_r2106282443


##
lucene/core/src/java/org/apache/lucene/search/BatchScoreBulkScorer.java:
##
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import org.apache.lucene.util.Bits;
+
+/**
+ * A bulk scorer used when {@link ScoreMode#needsScores()} is true and {@link
+ * Scorer#nextDocsAndScores} has optimizations to run faster than one-by-one 
iteration.
+ */
+class BatchScoreBulkScorer extends BulkScorer {
+
+  private final SimpleScorable scorable = new SimpleScorable();
+  private final DocAndScoreBuffer buffer = new DocAndScoreBuffer();
+  private final Scorer scorer;
+
+  BatchScoreBulkScorer(Scorer scorer) {
+this.scorer = scorer;
+  }
+
+  @Override
+  public int score(LeafCollector collector, Bits acceptDocs, int min, int max) 
throws IOException {
+if (collector.competitiveIterator() != null) {
+  return new Weight.DefaultBulkScorer(scorer).score(collector, acceptDocs, 
min, max);
+}

Review Comment:
   I wonder if this should be an implementation detail of `DefaultBulkScorer` 
instead of a different class. Doing something like
   
   ```
   if (scoreMode == TOP_SCORES && competitiveIterator == null) {
 // new optimization
   } else {
 // existing DefaultBulkScorer code
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use a hint to specify READONCE IOContext [lucene]

2025-05-25 Thread via GitHub


jpountz commented on code in PR #14509:
URL: https://github.com/apache/lucene/pull/14509#discussion_r2106283643


##
lucene/core/src/java/org/apache/lucene/store/IOContext.java:
##
@@ -56,7 +56,7 @@ interface FileOpenHint {}
* This context should only be used when the read operations will be 
performed in the same
* thread as the thread that opens the underlying storage.
*/
-  IOContext READONCE = new DefaultIOContext(DataAccessHint.SEQUENTIAL);
+  IOContext READONCE = new DefaultIOContext(DataAccessHint.SEQUENTIAL, 
ReadOnceHint.INSTANCE);

Review Comment:
   OK, I had missed that. SEQUENTIAL intuitively doesn't sound like the best 
option, but let's keep it for now then, I'll open a separate discussion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Refactor main top-n bulk scorers to evaluate hits in a more term-at-a-time fashion. [lucene]

2025-05-25 Thread via GitHub


jpountz merged PR #14701:
URL: https://github.com/apache/lucene/pull/14701


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix Method declared 'final' in 'final' class in LongHeap. [lucene]

2025-05-25 Thread via GitHub


github-actions[bot] commented on PR #14712:
URL: https://github.com/apache/lucene/pull/14712#issuecomment-2908407435

   This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. 
If the PR doesn't need a changelog entry, then add the skip-changelog-check 
label to it and you will stop receiving this reminder on future updates to the 
PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Fix Method declared 'final' in 'final' class in LongHeap. [lucene]

2025-05-25 Thread via GitHub


vsop-479 opened a new pull request, #14712:
URL: https://github.com/apache/lucene/pull/14712

   ### Description
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up TermQuery [lucene]

2025-05-25 Thread via GitHub


github-actions[bot] commented on PR #14709:
URL: https://github.com/apache/lucene/pull/14709#issuecomment-2908065719

   This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. 
If the PR doesn't need a changelog entry, then add the skip-changelog-check 
label to it and you will stop receiving this reminder on future updates to the 
PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Fix comment above OnHeapHnswGraph#getNeighbors. [lucene]

2025-05-25 Thread via GitHub


vsop-479 opened a new pull request, #14713:
URL: https://github.com/apache/lucene/pull/14713

   ### Description
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix comment above OnHeapHnswGraph#getNeighbors. [lucene]

2025-05-25 Thread via GitHub


github-actions[bot] commented on PR #14713:
URL: https://github.com/apache/lucene/pull/14713#issuecomment-2908433928

   This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. 
If the PR doesn't need a changelog entry, then add the skip-changelog-check 
label to it and you will stop receiving this reminder on future updates to the 
PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Remove Telugu normalization of vu వు to ma మ from IndicNormalizer [lucene]

2025-05-25 Thread via GitHub


praveen-d291 commented on issue #14659:
URL: https://github.com/apache/lucene/issues/14659#issuecomment-2907706053

   @rmuir,
   
   You're absolutely right; I should have led with this data in my initial 
comment. My apologies for not providing the "homework" upfront.
   
   Here's a direct look at the state of modern Telugu content, which strongly 
suggests that the issues the IndicNormalizationFilter was designed to address 
are less prevalent now:
   
   1. **Prevalence of Clean Unicode Text**:
   I've analyzed several high-volume, real-world Telugu sources, and the trend 
towards clean Unicode is very clear across these examples:
   
   
   - The official website of the Government of Telangana: 
https://www.telangana.gov.in/te/
   -  The Andhra Pradesh Government's Irrigation Department website: 
https://irrigationap.cgg.gov.in/wrd/home
   -  The Andhra Pradesh Agriculture Department website: 
https://www.apagrisnet.gov.in/
   - A major Telugu news publication like Eenadu: https://www.eenadu.net/ 
(consistently a top 3 paper by circulation).
   
   All content on these sites consistently uses UTF-8 Unicode. Characters like 
వు (vu) and మ (ma) are rendered distinctly and unambiguously.
   
   2. **Widespread OS-Level Font Support**:
   The need for "custom fonts from websites" or "janky conversion" is largely 
gone because popular OS vendors have been bundling robust Telugu font support 
for over two decades:
   
   **Windows**: Gautami has been included since 2001 
(https://en.wikipedia.org/wiki/Gautami_(typeface)). Nirmala UI, a comprehensive 
typeface for Indic scripts, has been bundled since Windows 8 
(https://en.wikipedia.org/wiki/Nirmala_UI).
   **macOS**: macOS Monterey alone includes 15 Telugu fonts (Apple support 
page: https://support.apple.com/en-in/103203).
   This widespread, native OS support directly translates to users generally 
not dealing with systems that require special handling or struggle with complex 
script rendering for modern Unicode Telugu text.
   
   The core issue is that applying the వు to మ conflation by default now 
introduces a linguistically incorrect loss of precision for the vast majority 
of current Telugu content. Given this, I want to reiterate the two options I 
proposed earlier for addressing this:
   
   Option 1: Fix the Default (My Preference)
   I'd propose adding a boolean option to the TeluguAnalyzer constructor to 
control IndicNormalizationFilter inclusion, and make its default false. This 
would make TeluguAnalyzer precise right out of the box for modern documents. 
Users with older, less-formatted text could still explicitly enable it. I 
believe this is a necessary correction for linguistic accuracy and explicitly 
documents this conversion.
   
   Option 2: Document the behavior in TeluguAnalyzer
   Alternatively, we could document this specific behavior in the 
TeluguAnalyzer docs, explaining the వు to మ mapping and how to build a custom 
analyzer to avoid it.
   
   Option 1 feels like the right long-term fix for the default user experience, 
given the current state of Telugu content. What do you think? I can raise a PR 
after agreeing on this topic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org