[PR] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-03-27 Thread via GitHub


guojialiang92 opened a new pull request, #14417:
URL: https://github.com/apache/lucene/pull/14417

   ### Description
   
   
   
   This PR aims to address issue 
[14362](https://github.com/apache/lucene/issues/14362). This issue includes a 
discussion of the benefits of this modification.
   
   ### Tests
   I added test `TestIndexWriter#testAdvanceSegmentInfosCounter`.
   
   ### Checklist
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my 
code conforms to the standards described there to the best of my ability.
   - [x] I have given Lucene maintainers 
[access](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the main branch.
   - [x] I have run ./gradlew check.
   - [x] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Use @snippet javadoc tag for snippets [lucene]

2025-03-27 Thread via GitHub


dweiss commented on issue #14257:
URL: https://github.com/apache/lucene/issues/14257#issuecomment-2756943569

   It'll work with those indented lines as well - it actually will align block 
indentation to the column they should be starting at. So:
   ```
   /// [Collector] for cutter+recorder pair. NOCOMMIT: This is a long line 
comment that should not be broken into smaller chunks.
   ///   * also, the space after /// should be preserved
   ///   * like here.
 public FacetFieldCollector(FacetCutter facetCutter, FacetRecorder 
facetRecorder) {
   this.facetCutter = facetCutter;
   this.facetRecorder = facetRecorder;
 }
   ```
   ends up this after running tidy:
   ```
 /// [Collector] for cutter+recorder pair. NOCOMMIT: This is a long line 
comment that should not be broken into smaller chunks.
 ///   * also, the space after /// should be preserved
 ///   * like here.
 public FacetFieldCollector(FacetCutter facetCutter, FacetRecorder 
facetRecorder) {
   this.facetCutter = facetCutter;
   this.facetRecorder = facetRecorder;
 }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Use @snippet javadoc tag for snippets [lucene]

2025-03-27 Thread via GitHub


dweiss commented on issue #14257:
URL: https://github.com/apache/lucene/issues/14257#issuecomment-2756947753

   [here's the gjf issue I asked for the direction they'd like to 
follow](https://github.com/google/google-java-format/issues/1193)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speedup merging of HNSW graphs [lucene]

2025-03-27 Thread via GitHub


mayya-sharipova commented on code in PR #14331:
URL: https://github.com/apache/lucene/pull/14331#discussion_r2005461835


##
lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswMerger.java:
##
@@ -51,19 +57,85 @@ protected HnswBuilder createBuilder(KnnVectorValues 
mergedVectorValues, int maxO
 OnHeapHnswGraph graph;
 BitSet initializedNodes = null;
 
-if (initReader == null) {
+if (graphReaders.size() == 0) {
   graph = new OnHeapHnswGraph(M, maxOrd);
 } else {
+  
graphReaders.sort(Comparator.comparingInt(GraphReader::graphSize).reversed());
+  GraphReader initGraphReader = graphReaders.get(0);
+  KnnVectorsReader initReader = initGraphReader.reader();
+  MergeState.DocMap initDocMap = initGraphReader.initDocMap();
+  int initGraphSize = initGraphReader.graphSize();
   HnswGraph initializerGraph = ((HnswGraphProvider) 
initReader).getGraph(fieldInfo.name);
+
   if (initializerGraph.size() == 0) {
 graph = new OnHeapHnswGraph(M, maxOrd);
   } else {
 initializedNodes = new FixedBitSet(maxOrd);
-int[] oldToNewOrdinalMap = getNewOrdMapping(mergedVectorValues, 
initializedNodes);
+int[] oldToNewOrdinalMap =
+getNewOrdMapping(
+fieldInfo,
+initReader,
+initDocMap,
+initGraphSize,
+mergedVectorValues,
+initializedNodes);
 graph = InitializedHnswGraphBuilder.initGraph(initializerGraph, 
oldToNewOrdinalMap, maxOrd);
   }
 }
 return new HnswConcurrentMergeBuilder(
 taskExecutor, numWorker, scorerSupplier, beamWidth, graph, 
initializedNodes);
   }
+
+  /**
+   * Creates a new mapping from old ordinals to new ordinals and returns the 
total number of vectors
+   * in the newly merged segment.
+   *
+   * @param mergedVectorValues vector values in the merged segment
+   * @param initializedNodes track what nodes have been initialized
+   * @return the mapping from old ordinals to new ordinals
+   * @throws IOException If an error occurs while reading from the merge state
+   */
+  private static final int[] getNewOrdMapping(

Review Comment:
   Addressed in cb852a6387a09ba43049b8a24f1e026c309b368b



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Revert "gh-12627: HnswGraphBuilder connects disconnected HNSW graph components (#13566)" [lucene]

2025-03-27 Thread via GitHub


benwtrent commented on PR #14411:
URL: https://github.com/apache/lucene/pull/14411#issuecomment-2757791544

   @txwei have you seen this behavior in production? I am wondering on the 
urgency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] HNSW connect components can take an inordinate amount of time [lucene]

2025-03-27 Thread via GitHub


benwtrent commented on issue #14214:
URL: https://github.com/apache/lucene/issues/14214#issuecomment-2757791987

   > Can we expose a graph construction parameter in Lucene99HnswVectorsFormat 
to gate the connectComponents() call? This would allow us to mitigate this 
issue while a more comprehensive fix is developed.
   
   I would argue not. We do not want more knobs to tune. We should instead fix 
the diversity & connection checks to be dynamic (e.g. fills connections when 
distribution of scores is poor, etc.).
   
   As for reverting the change, I am on the fence about it. Maybe that is the 
right thing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Use @snippet javadoc tag for snippets [lucene]

2025-03-27 Thread via GitHub


dweiss commented on issue #14257:
URL: https://github.com/apache/lucene/issues/14257#issuecomment-2757854863

   Let's wait a few days and see if there's any feedback.  Like I mentioned 
above, what we use for formatting/checking format adherence shouldn't really 
matter for anybody who uses gradlew tidy in Lucene - gcf is an ASL licensed 
software and if we find anything problematic, I'll just publish my own fork and 
we can switch to that (and then we have the freedom of doing whatever, 
including applying markdown formatting of our own... like 
https://github.com/commonmark/commonmark-java).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Use @snippet javadoc tag for snippets [lucene]

2025-03-27 Thread via GitHub


rmuir commented on issue #14257:
URL: https://github.com/apache/lucene/issues/14257#issuecomment-2757832416

   awesome! I really hope the patch is accepted: this will definitely give us a 
path forward. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Use @snippet javadoc tag for snippets [lucene]

2025-03-27 Thread via GitHub


rmuir commented on issue #14257:
URL: https://github.com/apache/lucene/issues/14257#issuecomment-2757929583

   yeah: only thing I will say about choice of formatter is that normally I try 
to configure the editor in the repo to match what the build expects. e.g. for 
languages like python and typescript, it means disabling all formatting-related 
diagnostics and enabling "format-on-save" and "organize-imports-on-save", so 
that things "just work".  It makes for easy dev experience where users have 
less friction with CI. 
   
   But with java, such a setup seems difficult to achieve: I'm not sure anyone 
even has that today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] #14410 - Add Anytime Ranking Searching - SLA-constrained ranking With Range Boosting and Dynamic SLA [lucene]

2025-03-27 Thread via GitHub


atris commented on PR #14409:
URL: https://github.com/apache/lucene/pull/14409#issuecomment-2758008422

   My bad - should have set some more context.
   
   In reference to 
   
   https://github.com/apache/lucene/issues/13675
   
   The paper referred to by Adrien has a component on anytime search - SLA 
based termination coupled with optional docID range based boosting. This PR is 
an implementation of the same


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] #14410 - Add Anytime Ranking Searching - SLA-constrained ranking With Range Boosting and Dynamic SLA [lucene]

2025-03-27 Thread via GitHub


benwtrent commented on PR #14409:
URL: https://github.com/apache/lucene/pull/14409#issuecomment-2758025254

   @atris Ah, thank you, I will take a look at the paper first.
   
   Do you have any benchmarking to replicate the paper's findings within the 
Apache Lucene context?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] #14410 - Add Anytime Ranking Searching - SLA-constrained ranking With Range Boosting and Dynamic SLA [lucene]

2025-03-27 Thread via GitHub


benwtrent commented on PR #14409:
URL: https://github.com/apache/lucene/pull/14409#issuecomment-2757970091

   @atris I am gonna be frank, I haven't a clue what this is doing :D. 
   
   Why is this being added? What is it supposed to accomplish?
   
   Maybe there is context I am missing...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] skip keyword in German Normalization Filter [lucene]

2025-03-27 Thread via GitHub


rmuir commented on PR #14416:
URL: https://github.com/apache/lucene/pull/14416#issuecomment-2757816185

   i think it will introduce a ton more complexity: that's why we've pushed 
back on doing this for anything that isn't a stemmer. otherwise people will 
want LowerCaseFilter to respect it too... and then indexing performance is dead.
   
   If you want to change the behavior of GermanAnalyzer then just make a 
CustomAnalyzer with ProtectedTermsFilter.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Issue Tracker Link under 'Editing Content on the Lucene™ Sites' [lucene-site]

2025-03-27 Thread via GitHub


dweiss commented on code in PR #78:
URL: https://github.com/apache/lucene-site/pull/78#discussion_r2003502701


##
content/pages/site-instructions.md:
##
@@ -3,8 +3,10 @@ URL: site-instructions.html
 save_as: site-instructions.html
 template: lucene/tlp/page
 
+
 ## Editing Content on the Lucene™ sites
+The web site is hosted in its own git repository **lucene-site** (see 
[Github](https://github.com/apache/lucene-site) and 
[Gitbox](https://gitbox.apache.org/repos/asf?p=lucene-site.git)).
 
-The web site is hosted in its own git repository `lucene-site` (see 
[Github](https://github.com/apache/lucene-site/) and 
[Gitbox](https://gitbox.apache.org/repos/asf/lucene-site.git)).
+Pushing to the `main` branch will update the staging site while pushing to 
`production` branch will update the main web site. Read the `README.md` file 
for further instructions.
 
-Pushing to the `main` branch will update the [staging 
site](https://lucene.staged.apache.org) while pushing to `production` branch 
will update the main web site. Read the 
[README.md](https://github.com/apache/lucene-site/blob/main/README.md) file for 
further instructions.
+For reporting website-related issues or suggesting improvements, please visit 
our [Issue Tracker](https://issues.apache.org/jira/projects/LUCENE).

Review Comment:
   The issue tracker is no longer Jira - it should be github issues on this 
project (lucene-site), perhaps?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]

2025-03-27 Thread via GitHub


jpountz commented on code in PR #14273:
URL: https://github.com/apache/lucene/pull/14273#discussion_r2017546602


##
lucene/core/src/java/org/apache/lucene/search/DISIDocIdStream.java:
##
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import org.apache.lucene.util.FixedBitSet;
+
+final class DISIDocIdStream extends DocIdStream {
+
+  private final DocIdSetIterator iterator;
+  private final int to;
+  private final FixedBitSet spare;
+
+  DISIDocIdStream(DocIdSetIterator iterator, int to, FixedBitSet spare) {

Review Comment:
   I'm leaning towards waiting until there's a need for it before adding it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]

2025-03-27 Thread via GitHub


jpountz commented on code in PR #14273:
URL: https://github.com/apache/lucene/pull/14273#discussion_r2017552582


##
lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java:
##
@@ -207,8 +164,32 @@ private void scoreWindowIntoBitSetAndReplay(
   acceptDocs.applyMask(matching, base);
 }
 
-docIdStreamView.base = base;
-collector.collect(docIdStreamView);
+if (buckets == null) {
+  if (acceptDocs != null) {
+// In this case, live docs have not been applied yet.
+acceptDocs.applyMask(matching, base);
+  }

Review Comment:
   Thanks for catching, this bit of code was mistakenly duplicated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]

2025-03-27 Thread via GitHub


jpountz commented on code in PR #14273:
URL: https://github.com/apache/lucene/pull/14273#discussion_r2017556183


##
lucene/core/src/java/org/apache/lucene/search/BitSetDocIdStream.java:
##
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import org.apache.lucene.util.FixedBitSet;
+
+final class BitSetDocIdStream extends DocIdStream {
+
+  private final FixedBitSet bitSet;
+  private final int offset, max;
+  private int upTo;
+
+  BitSetDocIdStream(FixedBitSet bitSet, int offset) {
+this.bitSet = bitSet;
+this.offset = offset;
+upTo = offset;
+max = (int) Math.min(Integer.MAX_VALUE, (long) offset + bitSet.length());
+  }
+
+  @Override
+  public boolean mayHaveRemaining() {
+return upTo < max;
+  }
+
+  @Override
+  public void forEach(int upTo, CheckedIntConsumer consumer) 
throws IOException {
+if (upTo >= this.upTo) {

Review Comment:
   (For reference, it's not as much for short-circuiting as for the fact that 
logic under the if block would fail if trying to move backwards. But since 
we're doing a check anyway, I agree it's also worth excluding the trivial case 
when the range of docs to collect is empty.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]

2025-03-27 Thread via GitHub


msfroh commented on PR #14413:
URL: https://github.com/apache/lucene/pull/14413#issuecomment-2759357863

   Does it make sense to create a separate `QueryProfilerBreakDown` per leaf? 
Or should it create one per slice? 
   
   Can this be implemented as part of the `CollectorManager#newCollector` logic?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]

2025-03-27 Thread via GitHub


jpountz commented on code in PR #14273:
URL: https://github.com/apache/lucene/pull/14273#discussion_r2017544601


##
lucene/core/src/java/org/apache/lucene/search/DocIdStream.java:
##
@@ -34,12 +33,34 @@ protected DocIdStream() {}
* Iterate over doc IDs contained in this stream in order, calling the given 
{@link
* CheckedIntConsumer} on them. This is a terminal operation.
*/
-  public abstract void forEach(CheckedIntConsumer consumer) 
throws IOException;
+  public void forEach(CheckedIntConsumer consumer) throws 
IOException {
+forEach(DocIdSetIterator.NO_MORE_DOCS, consumer);
+  }
+
+  /**
+   * Iterate over doc IDs contained in this doc ID stream up to the given 
{@code upTo} exclusive,
+   * calling the given {@link CheckedIntConsumer} on them. It is not possible 
to iterate these doc
+   * IDs again later on.
+   */
+  public abstract void forEach(int upTo, CheckedIntConsumer 
consumer)
+  throws IOException;
 
   /** Count the number of entries in this stream. This is a terminal 
operation. */
   public int count() throws IOException {
-int[] count = new int[1];
-forEach(_ -> count[0]++);
-return count[0];
+return count(DocIdSetIterator.NO_MORE_DOCS);
   }
+
+  /**
+   * Count the number of doc IDs in this stream that are below the given 
{@code upTo}. These doc IDs
+   * may not be consumed again later.
+   */
+  // Note: it's abstract rather than having a default impl that delegates to 
#forEach because doing

Review Comment:
   Ah right, it was based on your previous feedback. :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow skip cache factor to be updated dynamically [lucene]

2025-03-27 Thread via GitHub


sgup432 commented on code in PR #14412:
URL: https://github.com/apache/lucene/pull/14412#discussion_r2017693759


##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -122,12 +123,30 @@ public LRUQueryCache(
   long maxRamBytesUsed,
   Predicate leavesToCache,
   float skipCacheFactor) {
+this(maxSize, maxRamBytesUsed, leavesToCache, new 
AtomicReference<>(skipCacheFactor));
+  }
+
+  /**
+   * Additionally, allows the ability to pass skipCacheFactor in form of 
AtomicReference where the
+   * caller can dynamically update(in a thread safe way) its value by calling 
skipCacheFactor.set()
+   * on their end.
+   */
+  public LRUQueryCache(
+  int maxSize,
+  long maxRamBytesUsed,
+  Predicate leavesToCache,
+  AtomicReference skipCacheFactor) {

Review Comment:
   I actually made it a AtomicReference so that a caller can create that on 
their end and pass this reference here. Via which the caller can change its 
value dynamically.
   
   Other way to set skipFactor dynamically would have been getter/setter like 
you mentioned below. But then one cannot use an interface `QueryCache` to 
initialize `LRUQueryCache`, and they are forced to use concrete 
implementation(to get/set) which I think is not right.
   
   ```
   QueryCache lruCache = new LRUQueryCache();
   // Cannot accesss getter/setter until unless I typecast or use concrete 
implementation directly.
   ```
   
   Let me know what you think.



##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -122,12 +123,30 @@ public LRUQueryCache(
   long maxRamBytesUsed,
   Predicate leavesToCache,
   float skipCacheFactor) {
+this(maxSize, maxRamBytesUsed, leavesToCache, new 
AtomicReference<>(skipCacheFactor));
+  }
+
+  /**
+   * Additionally, allows the ability to pass skipCacheFactor in form of 
AtomicReference where the
+   * caller can dynamically update(in a thread safe way) its value by calling 
skipCacheFactor.set()
+   * on their end.
+   */
+  public LRUQueryCache(
+  int maxSize,
+  long maxRamBytesUsed,
+  Predicate leavesToCache,
+  AtomicReference skipCacheFactor) {

Review Comment:
   I actually made it a AtomicReference so that a caller can create that on 
their end and pass this reference here. Via which the caller can change its 
value dynamically.
   
   Other way to set skipFactor dynamically would have been via getter/setter 
like you mentioned below. But then one cannot use an interface `QueryCache` to 
initialize `LRUQueryCache`, and they are forced to use concrete 
implementation(to get/set) which I think is not right.
   
   ```
   QueryCache lruCache = new LRUQueryCache();
   // Cannot accesss getter/setter until unless I typecast or use concrete 
implementation directly.
   ```
   
   Let me know what you think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-27 Thread via GitHub


gf2121 commented on code in PR #14333:
URL: https://github.com/apache/lucene/pull/14333#discussion_r2006889147


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java:
##
@@ -0,0 +1,552 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene90.blocktree;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.ListIterator;
+import java.util.function.BiConsumer;
+import org.apache.lucene.store.DataOutput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+
+/** TODO make it a more memory efficient structure */
+class TrieBuilder {
+
+  static final int SIGN_NO_CHILDREN = 0x00;
+  static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01;
+  static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02;
+  static final int SIGN_MULTI_CHILDREN = 0x03;
+
+  static final int LEAF_NODE_HAS_TERMS = 1 << 5;
+  static final int LEAF_NODE_HAS_FLOOR = 1 << 6;
+  static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1;
+  static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0;
+
+  /**
+   * The output describing the term block the prefix point to.
+   *
+   * @param fp describes the on-disk terms block which a trie node points to.
+   * @param hasTerms A boolean which will be false if this on-disk block 
consists entirely of
+   * pointers to child blocks.
+   * @param floorData A {@link BytesRef} which will be non-null when a large 
block of terms sharing
+   * a single trie prefix is split into multiple on-disk blocks.
+   */
+  record Output(long fp, boolean hasTerms, BytesRef floorData) {}
+
+  private enum Status {
+BUILDING,
+SAVED,
+DESTROYED
+  }
+
+  private static class Node {
+
+// The utf8 digit that leads to this Node, 0 for root node
+private final int label;
+// The children listed in order by their utf8 label
+private final LinkedList children;
+// The output of this node.
+private Output output;
+
+// Vars used during saving:
+
+// The file pointer point to where the node saved. -1 means the node has 
not been saved.
+private long fp = -1;
+// The iterator whose next() point to the first child has not been saved.
+private Iterator childrenIterator;
+
+Node(int label, Output output, LinkedList children) {
+  this.label = label;
+  this.output = output;
+  this.children = children;
+}
+  }
+
+  private Status status = Status.BUILDING;
+  final Node root = new Node(0, null, new LinkedList<>());
+
+  static TrieBuilder bytesRefToTrie(BytesRef k, Output v) {
+return new TrieBuilder(k, v);
+  }
+
+  private TrieBuilder(BytesRef k, Output v) {
+if (k.length == 0) {
+  root.output = v;
+  return;
+}
+Node parent = root;
+for (int i = 0; i < k.length; i++) {
+  int b = k.bytes[i + k.offset] & 0xFF;
+  Output output = i == k.length - 1 ? v : null;
+  Node node = new Node(b, output, new LinkedList<>());
+  parent.children.add(node);
+  parent = node;
+}
+  }
+
+  /**
+   * Absorb all (K, V) pairs from the given trie into this one. The given trie 
builder should not
+   * have key that already exists in this one, otherwise a {@link 
IllegalArgumentException } will be
+   * thrown and this trie will get destroyed.
+   *
+   * Note: the given trie will be destroyed after absorbing.
+   */
+  void absorb(TrieBuilder trieBuilder) {
+if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) {
+  throw new IllegalStateException("tries should be unsaved");
+}
+// Use a simple stack to avoid recursion.
+Deque stack = new ArrayDeque<>();
+stack.add(() -> absorb(this.root, trieBuilder.root, stack));
+while (!stack.isEmpty()) {
+  stack.pop().run();
+}
+trieBuilder.status = Status.DESTROYED;
+  }
+
+  private void absorb(Node n, Node add, Deque stack) {
+assert n.label == add.label;
+if (add.output != null) {
+  if (n.output != null) {
+ 

Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]

2025-03-27 Thread via GitHub


jpountz commented on PR #14401:
URL: https://github.com/apache/lucene/pull/14401#issuecomment-2759494063

   > This would also benefit https://github.com/apache/lucene/pull/14273
   
   I don't think so, or rather taking advantage of range collection shouldn't 
help more than what #14273 does with `RangeDocIdStream`?
   
   For clarity, this `collectRange` method is more useful for aggregating 
values (facet "recorders" if I use the terminology of the new faceting 
framework?). Implementations would need to consult the `DocValuesSkipper` to 
check if it has pre-aggregated data over ranges of doc IDs that are contained 
in the range of doc IDs passed to `collectRange`. These sub ranges could be 
aggregated in constant time, without having to iterate over docs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow skip cache factor to be updated dynamically [lucene]

2025-03-27 Thread via GitHub


jpountz commented on code in PR #14412:
URL: https://github.com/apache/lucene/pull/14412#discussion_r2017616847


##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -99,7 +100,7 @@ public class LRUQueryCache implements QueryCache, 
Accountable {
   private final Map cache;
   private final ReentrantReadWriteLock.ReadLock readLock;
   private final ReentrantReadWriteLock.WriteLock writeLock;
-  private final float skipCacheFactor;
+  private final AtomicReference skipCacheFactor;

Review Comment:
   Making it volatile should be enough?



##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -122,12 +123,30 @@ public LRUQueryCache(
   long maxRamBytesUsed,
   Predicate leavesToCache,
   float skipCacheFactor) {
+this(maxSize, maxRamBytesUsed, leavesToCache, new 
AtomicReference<>(skipCacheFactor));
+  }
+
+  /**
+   * Additionally, allows the ability to pass skipCacheFactor in form of 
AtomicReference where the
+   * caller can dynamically update(in a thread safe way) its value by calling 
skipCacheFactor.set()
+   * on their end.
+   */
+  public LRUQueryCache(
+  int maxSize,
+  long maxRamBytesUsed,
+  Predicate leavesToCache,
+  AtomicReference skipCacheFactor) {

Review Comment:
   Let's avoid exposing the implementation detail of how the skip cache factor 
is stored internally, ie. skipCacheFactor should be a regular `float` here?



##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -142,6 +161,10 @@ public LRUQueryCache(
 missCount = new LongAdder();
   }
 
+  AtomicReference getSkipCacheFactor() {
+return skipCacheFactor;
+  }

Review Comment:
   Can you add a regular getter/setter instead?



##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -269,8 +292,8 @@ boolean requiresEviction() {
   }
 
   CacheAndCount get(Query key, IndexReader.CacheHelper cacheHelper) {
-assert key instanceof BoostQuery == false;
-assert key instanceof ConstantScoreQuery == false;
+assert !(key instanceof BoostQuery);
+assert !(key instanceof ConstantScoreQuery);

Review Comment:
   Please undo these changes, these `== false` are on purpose (you'll find many 
of them across the code base) to be more readable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Leverage sparse doc value indexes for range and value facet collection [lucene]

2025-03-27 Thread via GitHub


jpountz commented on issue #14406:
URL: https://github.com/apache/lucene/issues/14406#issuecomment-2759414888

   Thanks for the explanation, that makes sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Revert "gh-12627: HnswGraphBuilder connects disconnected HNSW graph components (#13566)" [lucene]

2025-03-27 Thread via GitHub


txwei commented on PR #14411:
URL: https://github.com/apache/lucene/pull/14411#issuecomment-2759555261

   @benwtrent we haven't released this to prod yet. We spotted a severe perf 
regression with the degenerate case of all vectors being the same. Since we 
have enough customers using all kinds of embeddings in production, this is very 
likely to be an issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]

2025-03-27 Thread via GitHub


jpountz commented on PR #14413:
URL: https://github.com/apache/lucene/pull/14413#issuecomment-2759377415

   @jainankitk OK. In my opinion, it's more important to handle the concurrent 
and non-concurrent cases consistently than to save some overhead when searches 
are not concurrent. I'd really like non-concurrent search to look and feel like 
a concurrent search with a single slice running on a SameThreadExecutorService 
as far as profiling is concerned. So I wouldn't specialize the class hierarchy 
for concurrent vs. non-concurrent.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]

2025-03-27 Thread via GitHub


gsmiller commented on code in PR #14273:
URL: https://github.com/apache/lucene/pull/14273#discussion_r2017721363


##
lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java:
##
@@ -204,6 +205,40 @@ public int cardinality() {
 return Math.toIntExact(tot);
   }
 
+  /**
+   * Return the number of set bits between indexes {@code from} inclusive and 
{@code to} exclusive.
+   */
+  public int cardinality(int from, int to) {
+Objects.checkFromToIndex(from, to, length());
+
+int cardinality = 0;
+
+// First, align `from` with a word start, ie. a multiple of Long.SIZE (64)
+if ((from & 0x3F) != 0) {
+  long bits = this.bits[from >> 6] >>> from;
+  int numBitsTilNextWord = -from & 0x3F;
+  if (to - from < numBitsTilNextWord) {
+bits &= (1L << (to - from)) - 1L;
+return Long.bitCount(bits);
+  }
+  cardinality += Long.bitCount(bits);
+  from += numBitsTilNextWord;
+  assert (from & 0x3F) == 0;
+}
+
+for (int i = from >> 6, end = to >> 6; i < end; ++i) {
+  cardinality += Long.bitCount(bits[i]);
+}
+
+// Now handle bits between the last complete word and to
+if ((to & 0x3F) != 0) {
+  long bits = this.bits[to >> 6] << -to;

Review Comment:
   minor: one other small comment. I noticed in the new `forEach` method you 
added you use `long bits = this.bits[to >> 6] & ((1L << to) - 1);`. Is the 
rationale here that you need to persist the correct number of trailing zeros in 
the `forEach` implementation but not in this case since you're doing a bit 
count? Is this approach (shifting by `-to`) more performant (I ask since you 
could use the same approach as `forEach` here too for consistency, so I assume 
you had a reason ;))



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid reload block when seeking backward in SegmentTermsEnum. [lucene]

2025-03-27 Thread via GitHub


github-actions[bot] commented on PR #13253:
URL: https://github.com/apache/lucene/pull/13253#issuecomment-2759878401

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]

2025-03-27 Thread via GitHub


gsmiller commented on code in PR #14273:
URL: https://github.com/apache/lucene/pull/14273#discussion_r2017767190


##
lucene/core/src/java/org/apache/lucene/search/DISIDocIdStream.java:
##
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import org.apache.lucene.util.FixedBitSet;
+
+final class DISIDocIdStream extends DocIdStream {
+
+  private final DocIdSetIterator iterator;
+  private final int to;
+  private final FixedBitSet spare;
+
+  DISIDocIdStream(DocIdSetIterator iterator, int to, FixedBitSet spare) {

Review Comment:
   I think that's fair. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-03-27 Thread via GitHub


guojialiang92 commented on code in PR #14417:
URL: https://github.com/apache/lucene/pull/14417#discussion_r2015958210


##
lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java:
##
@@ -5037,4 +5037,47 @@ public void testDocValuesSkippingIndexWithoutDocValues() 
throws Exception {
   }
 }
   }
+
+  public void testAdvanceSegmentInfosCounter() throws IOException {
+Directory dir = newDirectory();
+
+IndexWriter writer;
+IndexReader reader;
+
+writer = new IndexWriter(dir, newIndexWriterConfig(new 
MockAnalyzer(random(;
+
+// add 10 documents
+for (int i = 0; i < 10; i++) {
+  addDocWithIndex(writer, i);
+  writer.commit();
+}
+long beforeAdvanceSegmentCounter = writer.getSegmentInfosCounter();
+writer.advanceSegmentInfosCounter(1);
+assertEquals(beforeAdvanceSegmentCounter, writer.getSegmentInfosCounter());

Review Comment:
   I have modified.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-03-27 Thread via GitHub


guojialiang92 commented on code in PR #14417:
URL: https://github.com/apache/lucene/pull/14417#discussion_r2015951779


##
lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:
##
@@ -1427,6 +1427,24 @@ public synchronized void advanceSegmentInfosVersion(long 
newVersion) {
 changed();
   }
 
+  /**
+   * If {@link SegmentInfos#counter} is below {@code newCounter} then update 
it to this value.
+   *
+   * @lucene.internal
+   */
+  public synchronized void advanceSegmentInfosCounter(long newCounter) {
+this.ensureOpen();
+if (segmentInfos.counter < newCounter) {
+  segmentInfos.counter = newCounter;
+}
+  }
+
+  /** Returns the {@link SegmentInfos#counter}. */
+  public synchronized long getSegmentInfosCounter() {

Review Comment:
   I have removed the synchronization to avoid affecting performance when it is 
heavily called in some scenarios.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-03-27 Thread via GitHub


guojialiang92 commented on code in PR #14417:
URL: https://github.com/apache/lucene/pull/14417#discussion_r2015954336


##
lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:
##
@@ -1427,6 +1427,24 @@ public synchronized void advanceSegmentInfosVersion(long 
newVersion) {
 changed();
   }
 
+  /**
+   * If {@link SegmentInfos#counter} is below {@code newCounter} then update 
it to this value.
+   *
+   * @lucene.internal
+   */
+  public synchronized void advanceSegmentInfosCounter(long newCounter) {

Review Comment:
   From the `IndexWriter#changed` interface definition, it should indeed be 
called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-03-27 Thread via GitHub


guojialiang92 commented on code in PR #14417:
URL: https://github.com/apache/lucene/pull/14417#discussion_r2015957918


##
lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java:
##
@@ -5037,4 +5037,47 @@ public void testDocValuesSkippingIndexWithoutDocValues() 
throws Exception {
   }
 }
   }
+
+  public void testAdvanceSegmentInfosCounter() throws IOException {
+Directory dir = newDirectory();
+
+IndexWriter writer;
+IndexReader reader;
+
+writer = new IndexWriter(dir, newIndexWriterConfig(new 
MockAnalyzer(random(;
+
+// add 10 documents
+for (int i = 0; i < 10; i++) {
+  addDocWithIndex(writer, i);
+  writer.commit();
+}
+long beforeAdvanceSegmentCounter = writer.getSegmentInfosCounter();
+writer.advanceSegmentInfosCounter(1);
+assertEquals(beforeAdvanceSegmentCounter, writer.getSegmentInfosCounter());
+
+writer.advanceSegmentInfosCounter(1000);
+assertEquals(1000, writer.getSegmentInfosCounter());

Review Comment:
   I have modified.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] #14410 - Add Anytime Ranking Searching - SLA-constrained ranking With Range Boosting and Dynamic SLA [lucene]

2025-03-27 Thread via GitHub


atris commented on PR #14409:
URL: https://github.com/apache/lucene/pull/14409#issuecomment-2756755022

   @jpountz @benwtrent requesting your review please 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-03-27 Thread via GitHub


guojialiang92 commented on PR #14417:
URL: https://github.com/apache/lucene/pull/14417#issuecomment-2757291661

   Thanks for helping with the code review, I have made modifications according 
to the suggestions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-03-27 Thread via GitHub


guojialiang92 commented on code in PR #14417:
URL: https://github.com/apache/lucene/pull/14417#discussion_r2015872376


##
lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java:
##
@@ -5037,4 +5037,45 @@ public void testDocValuesSkippingIndexWithoutDocValues() 
throws Exception {
   }
 }
   }
+
+  public void testAdvanceSegmentInfosCounter() throws IOException {
+Directory dir = newDirectory();
+
+IndexWriter writer;
+IndexReader reader;
+
+writer = new IndexWriter(dir, newIndexWriterConfig(new 
MockAnalyzer(random(;
+
+// add 10 documents
+for (int i = 0; i < 10; i++) {
+  addDocWithIndex(writer, i);
+  writer.commit();
+}
+long beforeAdvanceSegmentCounter = writer.getSegmentInfosCounter();
+writer.advanceSegmentInfosCounter(1);
+assertEquals(beforeAdvanceSegmentCounter, writer.getSegmentInfosCounter());
+
+writer.advanceSegmentInfosCounter(1000);
+assertEquals(1000, writer.getSegmentInfosCounter());
+
+// add 40 documents
+for (int i = 10; i < 50; i++) {
+  addDocWithIndex(writer, i);
+  writer.commit();
+}
+
+assertEquals(1041, writer.getSegmentInfosCounter());

Review Comment:
   You're right, I have already made modifications.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] skip keyword in German Normalization Filter [lucene]

2025-03-27 Thread via GitHub


xzhang9292 commented on PR #14416:
URL: https://github.com/apache/lucene/pull/14416#issuecomment-2756971573

   > This keyword is legacy, for stemmers not normalizers. Just use 
ProtectedTermFilter which works with any tokenfilter without requiring 
modification to its code?
   
   @rmuir Thank you for your comment. Yes, using ProtectedTermFilter could be 
an option when directly using German Normalization Filter. Our problem comes 
when we are using German Analyzer. The GermanNormalizer positioned before 
GermanStemmers, so even we put word like "Bär" in exclusion set, normalizer can 
still change it to "bar". I think when people put a word in exclusion set, they 
probably also want the normalizer to keep the word. Would it be beneficial that 
we let Normalizer also exclude keywords?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-03-27 Thread via GitHub


vigyasharma commented on issue #14362:
URL: https://github.com/apache/lucene/issues/14362#issuecomment-2756963298

   Sounds good to me. Will watch out for your PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-03-27 Thread via GitHub


vigyasharma commented on code in PR #14417:
URL: https://github.com/apache/lucene/pull/14417#discussion_r2015817265


##
lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java:
##
@@ -5037,4 +5037,45 @@ public void testDocValuesSkippingIndexWithoutDocValues() 
throws Exception {
   }
 }
   }
+
+  public void testAdvanceSegmentInfosCounter() throws IOException {
+Directory dir = newDirectory();
+
+IndexWriter writer;
+IndexReader reader;
+
+writer = new IndexWriter(dir, newIndexWriterConfig(new 
MockAnalyzer(random(;
+
+// add 10 documents
+for (int i = 0; i < 10; i++) {
+  addDocWithIndex(writer, i);
+  writer.commit();
+}
+long beforeAdvanceSegmentCounter = writer.getSegmentInfosCounter();
+writer.advanceSegmentInfosCounter(1);
+assertEquals(beforeAdvanceSegmentCounter, writer.getSegmentInfosCounter());
+
+writer.advanceSegmentInfosCounter(1000);
+assertEquals(1000, writer.getSegmentInfosCounter());
+
+// add 40 documents
+for (int i = 10; i < 50; i++) {
+  addDocWithIndex(writer, i);
+  writer.commit();
+}
+
+assertEquals(1041, writer.getSegmentInfosCounter());

Review Comment:
   I think it's sufficient to assert that the counter value is `>=1000` i.e. 
higher than what you advanced it to. `SegmentInfos.counter` is public and gets 
incremented each time a segment is named, including background merges. 
Asserting on an exact value can make the test flaky.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-03-27 Thread via GitHub


vigyasharma commented on code in PR #14417:
URL: https://github.com/apache/lucene/pull/14417#discussion_r2015825615


##
lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:
##
@@ -1427,6 +1427,24 @@ public synchronized void advanceSegmentInfosVersion(long 
newVersion) {
 changed();
   }
 
+  /**
+   * If {@link SegmentInfos#counter} is below {@code newCounter} then update 
it to this value.
+   *
+   * @lucene.internal
+   */
+  public synchronized void advanceSegmentInfosCounter(long newCounter) {

Review Comment:
   Do we need to call `changed()` or increment change count here?



##
lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:
##
@@ -1427,6 +1427,24 @@ public synchronized void advanceSegmentInfosVersion(long 
newVersion) {
 changed();
   }
 
+  /**
+   * If {@link SegmentInfos#counter} is below {@code newCounter} then update 
it to this value.
+   *
+   * @lucene.internal
+   */
+  public synchronized void advanceSegmentInfosCounter(long newCounter) {
+this.ensureOpen();
+if (segmentInfos.counter < newCounter) {
+  segmentInfos.counter = newCounter;
+}
+  }
+
+  /** Returns the {@link SegmentInfos#counter}. */
+  public synchronized long getSegmentInfosCounter() {

Review Comment:
   Do we really need this synchronized?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]

2025-03-27 Thread via GitHub


vigyasharma commented on code in PR #14417:
URL: https://github.com/apache/lucene/pull/14417#discussion_r2015832195


##
lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java:
##
@@ -5037,4 +5037,47 @@ public void testDocValuesSkippingIndexWithoutDocValues() 
throws Exception {
   }
 }
   }
+
+  public void testAdvanceSegmentInfosCounter() throws IOException {
+Directory dir = newDirectory();
+
+IndexWriter writer;
+IndexReader reader;
+
+writer = new IndexWriter(dir, newIndexWriterConfig(new 
MockAnalyzer(random(;
+
+// add 10 documents
+for (int i = 0; i < 10; i++) {
+  addDocWithIndex(writer, i);
+  writer.commit();
+}
+long beforeAdvanceSegmentCounter = writer.getSegmentInfosCounter();
+writer.advanceSegmentInfosCounter(1);
+assertEquals(beforeAdvanceSegmentCounter, writer.getSegmentInfosCounter());
+
+writer.advanceSegmentInfosCounter(1000);
+assertEquals(1000, writer.getSegmentInfosCounter());

Review Comment:
   Do we need this check? Let's assert with `gte` to avoid flakiness.



##
lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java:
##
@@ -5037,4 +5037,47 @@ public void testDocValuesSkippingIndexWithoutDocValues() 
throws Exception {
   }
 }
   }
+
+  public void testAdvanceSegmentInfosCounter() throws IOException {
+Directory dir = newDirectory();
+
+IndexWriter writer;
+IndexReader reader;
+
+writer = new IndexWriter(dir, newIndexWriterConfig(new 
MockAnalyzer(random(;
+
+// add 10 documents
+for (int i = 0; i < 10; i++) {
+  addDocWithIndex(writer, i);
+  writer.commit();
+}
+long beforeAdvanceSegmentCounter = writer.getSegmentInfosCounter();
+writer.advanceSegmentInfosCounter(1);
+assertEquals(beforeAdvanceSegmentCounter, writer.getSegmentInfosCounter());

Review Comment:
   Let's add a similar `gte` check here as in the assert below, instead of 
assuming that counter value is 1.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]

2025-03-27 Thread via GitHub


jpountz commented on PR #14273:
URL: https://github.com/apache/lucene/pull/14273#issuecomment-2758283399

   I played with the geonames dataset, by filtering out docs that don't have a 
value for the `elevation` field (2.3M docs left), enabling index sorting on the 
`elevation` field and computing histograms on the `elevation` field with a 
bucket width of 100.
   
   | Query |  | Latency on main (ms) | Latency on branch (ms) |
   | - | - | - | - |
   | `MatchAllDocsQuery` | Uses `RangeDocIdStream` under the hood | 6.9 | 4.3 |
   | `featureClass:(S P)` | Matches spots or cities, 1.2 matching docs, uses 
`BitSetDocIdStream` under the hood | 4.8 | 2.4 |
   
   I also checked wikibigall, no slowdowns were detected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Use @snippet javadoc tag for snippets [lucene]

2025-03-27 Thread via GitHub


dweiss commented on issue #14257:
URL: https://github.com/apache/lucene/issues/14257#issuecomment-2758276309

   You're right - this may be a problem. I use intellij and even gjf there does 
not work exactly the same way 
(https://github.com/google/google-java-format/pull/1165). I stopped caring 
after a while. Just do whatever I need, then run a tidy before committing.  I 
know - not ideal. But at least a workable solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make PointValues.intersect iterative instead of recursive [lucene]

2025-03-27 Thread via GitHub


iverase commented on code in PR #14391:
URL: https://github.com/apache/lucene/pull/14391#discussion_r2009830096


##
lucene/core/src/java/org/apache/lucene/index/PointValues.java:
##
@@ -351,35 +351,32 @@ public final void intersect(IntersectVisitor visitor) 
throws IOException {
 assert pointTree.moveToParent() == false;
   }
 
-  private void intersect(IntersectVisitor visitor, PointTree pointTree) throws 
IOException {
-Relation r = visitor.compare(pointTree.getMinPackedValue(), 
pointTree.getMaxPackedValue());
-switch (r) {
-  case CELL_OUTSIDE_QUERY:
-// This cell is fully outside the query shape: stop recursing
-break;
-  case CELL_INSIDE_QUERY:
+  private static void intersect(IntersectVisitor visitor, PointTree pointTree) 
throws IOException {
+int depth = 0;

Review Comment:
   I don' think we need this variable, what do you think about this: 
   ```
 public final void intersect(IntersectVisitor visitor) throws IOException {
   final PointTree pointTree = getPointTree();
   while (true) {
 Relation compare =
 visitor.compare(pointTree.getMinPackedValue(), 
pointTree.getMaxPackedValue());
 if (compare == Relation.CELL_INSIDE_QUERY) {
   // This cell is fully inside the query shape: recursively add all 
points in this cell
   // without filtering
   pointTree.visitDocIDs(visitor);
 } else if (compare == Relation.CELL_CROSSES_QUERY) {
   // The cell crosses the shape boundary, or the cell fully contains 
the query, so we fall
   // through and do full filtering:
   if (pointTree.moveToChild()) {
 continue;
   }
   // TODO: we can assert that the first value here in fact matches 
what the pointTree
   // claimed?
   // Leaf node; scan and filter all points in this block:
   pointTree.visitDocValues(visitor);
 }
 while (pointTree.moveToSibling() == false) {
   if (pointTree.moveToParent() == false) {
 // we are at the root, finish.
 return;
   }
 }
   }
 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] #14410 - Add Anytime Ranking Searching - SLA-constrained ranking With Range Boosting and Dynamic SLA [lucene]

2025-03-27 Thread via GitHub


jpountz commented on PR #14409:
URL: https://github.com/apache/lucene/pull/14409#issuecomment-2758339188

   I don't think that the timout support that you are introducing buys anything 
compared with the existing timeout support via `IndexSearcher#setTimeout` and 
`TimeLimitingBulkScorer`. To me the interesting part is the ability to estimate 
the maximum possible score for various ranges of the doc ID space (there is an 
assumption that something like recursive graph bisection is enabled at index 
time) so that they can be searched first in order to help dynamic pruning, but 
it looks like this bit is left to the user in your PR. It would be interesting 
to look into doing this in Lucene by looking at impact blocks 
(`Scorer#advanceShallow`, `Scorer#getMaxScore`) for the various disjunctive 
clauses.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]

2025-03-27 Thread via GitHub


jainankitk commented on PR #14413:
URL: https://github.com/apache/lucene/pull/14413#issuecomment-2758741304

   > Can you explain why we need two impls? I would have assumed that the 
ConcurrentQueryProfilerBreakdown could also be used for searches that are not 
concurrent?
   
   `ConcurrentQueryProfilerBreakdown` maintains separate instance of 
`QueryProfileBreakdown` for each segment. In the context method, 
`ConcurrentQueryProfilerBreakdown` returns corresponding 
`QueryProfileBreakdown` object for each segment. 
   
   ```
 @Override
 public QueryProfilerBreakdown context(LeafReaderContext context) {
   return this;
 }
   ```
   
   Hence, I felt we don't need to take the overhead of creating breakdown per 
segment and then merging it together during response. That being said, 
eventually we can keep just `ConcurrentQueryProfilerBreakdown` if we prefer 
that for simplicity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]

2025-03-27 Thread via GitHub


gsmiller commented on PR #14401:
URL: https://github.com/apache/lucene/pull/14401#issuecomment-2758696779

   It makes sense to me to expose the idea of doc range collection as a 
first-class API on leaf collectors for the reasons you outlined above. This 
would also benefit #14273 as well right? If there are scorers that can leverage 
the range collection call, it would immediately benefit I believe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] skip keyword in German Normalization Filter [lucene]

2025-03-27 Thread via GitHub


xzhang9292 closed pull request #14416: skip keyword in German Normalization 
Filter
URL: https://github.com/apache/lucene/pull/14416


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow skip cache factor to be updated dynamically [lucene]

2025-03-27 Thread via GitHub


sgup432 commented on code in PR #14412:
URL: https://github.com/apache/lucene/pull/14412#discussion_r2017693759


##
lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java:
##
@@ -122,12 +123,30 @@ public LRUQueryCache(
   long maxRamBytesUsed,
   Predicate leavesToCache,
   float skipCacheFactor) {
+this(maxSize, maxRamBytesUsed, leavesToCache, new 
AtomicReference<>(skipCacheFactor));
+  }
+
+  /**
+   * Additionally, allows the ability to pass skipCacheFactor in form of 
AtomicReference where the
+   * caller can dynamically update(in a thread safe way) its value by calling 
skipCacheFactor.set()
+   * on their end.
+   */
+  public LRUQueryCache(
+  int maxSize,
+  long maxRamBytesUsed,
+  Predicate leavesToCache,
+  AtomicReference skipCacheFactor) {

Review Comment:
   I actually made it a AtomicReference so that a caller can create that on 
their end and pass this reference here. Via which the caller can change its 
value dynamically.
   
   Other way to set it dynamically would have been getter/setter like you 
mentioned below. But then one cannot use an interface `QueryCache` to 
initialize `LRUQueryCache`, and they are forced to use concrete 
implementation(to get/set) which I think is not right.
   
   ```
   QueryCache lruCache = new LRUQueryCache();
   // Cannot accesss getter/setter until unless I typecast or use concrete 
implementation directly.
   ```
   
   Let me know what you think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] quick exit on filter query matching no docs when rewriting knn query [lucene]

2025-03-27 Thread via GitHub


bugmakerr opened a new pull request, #14418:
URL: https://github.com/apache/lucene/pull/14418

   ### Description
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix TestHnswByteVectorGraph.testBuildingJoinSet [lucene]

2025-03-27 Thread via GitHub


mayya-sharipova merged PR #14398:
URL: https://github.com/apache/lucene/pull/14398


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]

2025-03-27 Thread via GitHub


jainankitk commented on PR #14413:
URL: https://github.com/apache/lucene/pull/14413#issuecomment-2759703236

   > Does it make sense to create a separate QueryProfilerBreakDown per leaf? 
Or should it create one per slice?
   
   Actually, create one per slice makes lot of sense.
   
   > Can this be implemented as part of the 
ProfilerCollectorManager#newCollector logic? Maybe not, since we would also 
need to profile the work done by the Weight + Scorer on each slice
   
   We can always use `ThreadId` for correctly tracking the mapping from slice 
to `QueryProfilerBreakdown` object.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]

2025-03-27 Thread via GitHub


jainankitk commented on PR #14413:
URL: https://github.com/apache/lucene/pull/14413#issuecomment-2759709664

   > In my opinion, it's more important to handle the concurrent and 
non-concurrent cases consistently than to save some overhead when searches are 
not concurrent. I'd really like non-concurrent search to look and feel like a 
concurrent search with a single slice running on a SameThreadExecutorService as 
far as profiling is concerned.
   
   Let me try and see if we can maintain per slice `QueryProfilerBreakdown` 
object. With that, both concurrent and non-concurrent paths would be consistent 
as well as efficient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]

2025-03-27 Thread via GitHub


gsmiller commented on code in PR #14273:
URL: https://github.com/apache/lucene/pull/14273#discussion_r2017239514


##
lucene/core/src/java/org/apache/lucene/search/DISIDocIdStream.java:
##
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import org.apache.lucene.util.FixedBitSet;
+
+final class DISIDocIdStream extends DocIdStream {
+
+  private final DocIdSetIterator iterator;
+  private final int to;

Review Comment:
   minor: maybe `max` to be consistent with the other implementations?



##
lucene/core/src/java/org/apache/lucene/search/DocIdStream.java:
##
@@ -34,12 +33,34 @@ protected DocIdStream() {}
* Iterate over doc IDs contained in this stream in order, calling the given 
{@link
* CheckedIntConsumer} on them. This is a terminal operation.
*/
-  public abstract void forEach(CheckedIntConsumer consumer) 
throws IOException;
+  public void forEach(CheckedIntConsumer consumer) throws 
IOException {
+forEach(DocIdSetIterator.NO_MORE_DOCS, consumer);
+  }
+
+  /**
+   * Iterate over doc IDs contained in this doc ID stream up to the given 
{@code upTo} exclusive,
+   * calling the given {@link CheckedIntConsumer} on them. It is not possible 
to iterate these doc
+   * IDs again later on.
+   */
+  public abstract void forEach(int upTo, CheckedIntConsumer 
consumer)
+  throws IOException;
 
   /** Count the number of entries in this stream. This is a terminal 
operation. */
   public int count() throws IOException {
-int[] count = new int[1];
-forEach(_ -> count[0]++);
-return count[0];
+return count(DocIdSetIterator.NO_MORE_DOCS);
   }
+
+  /**
+   * Count the number of doc IDs in this stream that are below the given 
{@code upTo}. These doc IDs
+   * may not be consumed again later.
+   */
+  // Note: it's abstract rather than having a default impl that delegates to 
#forEach because doing

Review Comment:
   Thanks for adding this comment! +1 to the rationale as well.



##
lucene/core/src/java/org/apache/lucene/search/DISIDocIdStream.java:
##
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import org.apache.lucene.util.FixedBitSet;
+
+final class DISIDocIdStream extends DocIdStream {
+
+  private final DocIdSetIterator iterator;
+  private final int to;
+  private final FixedBitSet spare;
+
+  DISIDocIdStream(DocIdSetIterator iterator, int to, FixedBitSet spare) {

Review Comment:
   Would it make sense to offer another ctor that takes care of allocating 
`spare`? I suppose there's not actually a use-case for it right now since the 
one calling code point of this currently has a bit set that can be reused, so 
maybe not worth adding until there's an actual reason.



##
lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java:
##
@@ -204,6 +205,39 @@ public int cardinality() {
 return Math.toIntExact(tot);
   }
 
+  /**
+   * Return the number of set bits between indexes {@code from} inclusive and 
{@code to} exclusive.
+   */
+  public int cardinality(int from, int to) {
+Objects.checkFromToIndex(from, to, length());
+
+int cardinality = 0;
+
+// First, align `from` with a word start, ie. a multiple of Long.SIZE (64)
+if ((from & 0x3F) != 0) {
+  long bits = this.bits[from >> 6] >>> from;
+  int

Re: [I] Leverage sparse doc value indexes for range and value facet collection [lucene]

2025-03-27 Thread via GitHub


gsmiller commented on issue #14406:
URL: https://github.com/apache/lucene/issues/14406#issuecomment-2759371300

   > Out of curiosity, is it common for the union of the configured ranges to 
only match a small subset of the index? I would naively expect users to want to 
collect stats about all their data, so there would be one open-ended range as a 
"case else" and such an optimization would never kick in in practice?
   
   Agreed it seems rare/unlikely but I have one such example of this we 
currently do in Amazon's product search engine. We support a "Customer Reviews" 
filter today that allows customers to filter by a product's "star rating", but 
the only option we currently expose to customers is "4 stars & up" (while star 
ratings for products run from 1 - 5 stars). We leverage faceting to compute 
aggregations over products with a review value of `4-`, and don't care about 
docs with < 4 stars (for the purpose of faceting).
   
   The problem with this example is that we're also faceting on other fields 
for the same match set, so even if we exposed a competitive iterator for the 
reviews case, it wouldn't actually let us skip anything in practice (at least 
not in the sandbox faceting module where we're computing many aggregations 
while collecting). But I suppose our traditional faceting implementation could 
leverage a sparse index directly to potentially skip over some blocks of 
documents.
   
   Agreed it's an unlikely optimization to have much value in practice :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]

2025-03-27 Thread via GitHub


jpountz commented on code in PR #14273:
URL: https://github.com/apache/lucene/pull/14273#discussion_r2017547274


##
lucene/core/src/java/org/apache/lucene/search/BitSetDocIdStream.java:
##
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import org.apache.lucene.util.FixedBitSet;
+
+final class BitSetDocIdStream extends DocIdStream {
+
+  private final FixedBitSet bitSet;
+  private final int offset, max;
+  private int upTo;
+
+  BitSetDocIdStream(FixedBitSet bitSet, int offset) {
+this.bitSet = bitSet;
+this.offset = offset;
+upTo = offset;
+max = (int) Math.min(Integer.MAX_VALUE, (long) offset + bitSet.length());
+  }
+
+  @Override
+  public boolean mayHaveRemaining() {
+return upTo < max;
+  }
+
+  @Override
+  public void forEach(int upTo, CheckedIntConsumer consumer) 
throws IOException {
+if (upTo >= this.upTo) {

Review Comment:
   You're right, I'll change this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Case-insensitive TermInSetQuery Implementation (Proof of Concept) [lucene]

2025-03-27 Thread via GitHub


github-actions[bot] commented on PR #14349:
URL: https://github.com/apache/lucene/pull/14349#issuecomment-2759877032

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]

2025-03-27 Thread via GitHub


github-actions[bot] commented on PR #14267:
URL: https://github.com/apache/lucene/pull/14267#issuecomment-2759877138

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Disable the query cache by default. [lucene]

2025-03-27 Thread via GitHub


github-actions[bot] commented on PR #14187:
URL: https://github.com/apache/lucene/pull/14187#issuecomment-2759877242

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] PointInSetQuery clips segments by lower and upper [lucene]

2025-03-27 Thread via GitHub


github-actions[bot] commented on PR #14268:
URL: https://github.com/apache/lucene/pull/14268#issuecomment-2759877100

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Terminate automaton when it can match all suffixes, and match suffixes directly. [lucene]

2025-03-27 Thread via GitHub


github-actions[bot] commented on PR #13072:
URL: https://github.com/apache/lucene/pull/13072#issuecomment-2759878644

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org