[PR] Support modifying segmentInfos.counter in IndexWriter [lucene]
guojialiang92 opened a new pull request, #14417: URL: https://github.com/apache/lucene/pull/14417 ### Description This PR aims to address issue [14362](https://github.com/apache/lucene/issues/14362). This issue includes a discussion of the benefits of this modification. ### Tests I added test `TestIndexWriter#testAdvanceSegmentInfosCounter`. ### Checklist - [x] I have reviewed the guidelines for [How to Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my code conforms to the standards described there to the best of my ability. - [x] I have given Lucene maintainers [access](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the main branch. - [x] I have run ./gradlew check. - [x] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
dweiss commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2756943569 It'll work with those indented lines as well - it actually will align block indentation to the column they should be starting at. So: ``` /// [Collector] for cutter+recorder pair. NOCOMMIT: This is a long line comment that should not be broken into smaller chunks. /// * also, the space after /// should be preserved /// * like here. public FacetFieldCollector(FacetCutter facetCutter, FacetRecorder facetRecorder) { this.facetCutter = facetCutter; this.facetRecorder = facetRecorder; } ``` ends up this after running tidy: ``` /// [Collector] for cutter+recorder pair. NOCOMMIT: This is a long line comment that should not be broken into smaller chunks. /// * also, the space after /// should be preserved /// * like here. public FacetFieldCollector(FacetCutter facetCutter, FacetRecorder facetRecorder) { this.facetCutter = facetCutter; this.facetRecorder = facetRecorder; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
dweiss commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2756947753 [here's the gjf issue I asked for the direction they'd like to follow](https://github.com/google/google-java-format/issues/1193) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup merging of HNSW graphs [lucene]
mayya-sharipova commented on code in PR #14331: URL: https://github.com/apache/lucene/pull/14331#discussion_r2005461835 ## lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswMerger.java: ## @@ -51,19 +57,85 @@ protected HnswBuilder createBuilder(KnnVectorValues mergedVectorValues, int maxO OnHeapHnswGraph graph; BitSet initializedNodes = null; -if (initReader == null) { +if (graphReaders.size() == 0) { graph = new OnHeapHnswGraph(M, maxOrd); } else { + graphReaders.sort(Comparator.comparingInt(GraphReader::graphSize).reversed()); + GraphReader initGraphReader = graphReaders.get(0); + KnnVectorsReader initReader = initGraphReader.reader(); + MergeState.DocMap initDocMap = initGraphReader.initDocMap(); + int initGraphSize = initGraphReader.graphSize(); HnswGraph initializerGraph = ((HnswGraphProvider) initReader).getGraph(fieldInfo.name); + if (initializerGraph.size() == 0) { graph = new OnHeapHnswGraph(M, maxOrd); } else { initializedNodes = new FixedBitSet(maxOrd); -int[] oldToNewOrdinalMap = getNewOrdMapping(mergedVectorValues, initializedNodes); +int[] oldToNewOrdinalMap = +getNewOrdMapping( +fieldInfo, +initReader, +initDocMap, +initGraphSize, +mergedVectorValues, +initializedNodes); graph = InitializedHnswGraphBuilder.initGraph(initializerGraph, oldToNewOrdinalMap, maxOrd); } } return new HnswConcurrentMergeBuilder( taskExecutor, numWorker, scorerSupplier, beamWidth, graph, initializedNodes); } + + /** + * Creates a new mapping from old ordinals to new ordinals and returns the total number of vectors + * in the newly merged segment. + * + * @param mergedVectorValues vector values in the merged segment + * @param initializedNodes track what nodes have been initialized + * @return the mapping from old ordinals to new ordinals + * @throws IOException If an error occurs while reading from the merge state + */ + private static final int[] getNewOrdMapping( Review Comment: Addressed in cb852a6387a09ba43049b8a24f1e026c309b368b -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Revert "gh-12627: HnswGraphBuilder connects disconnected HNSW graph components (#13566)" [lucene]
benwtrent commented on PR #14411: URL: https://github.com/apache/lucene/pull/14411#issuecomment-2757791544 @txwei have you seen this behavior in production? I am wondering on the urgency. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] HNSW connect components can take an inordinate amount of time [lucene]
benwtrent commented on issue #14214: URL: https://github.com/apache/lucene/issues/14214#issuecomment-2757791987 > Can we expose a graph construction parameter in Lucene99HnswVectorsFormat to gate the connectComponents() call? This would allow us to mitigate this issue while a more comprehensive fix is developed. I would argue not. We do not want more knobs to tune. We should instead fix the diversity & connection checks to be dynamic (e.g. fills connections when distribution of scores is poor, etc.). As for reverting the change, I am on the fence about it. Maybe that is the right thing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
dweiss commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2757854863 Let's wait a few days and see if there's any feedback. Like I mentioned above, what we use for formatting/checking format adherence shouldn't really matter for anybody who uses gradlew tidy in Lucene - gcf is an ASL licensed software and if we find anything problematic, I'll just publish my own fork and we can switch to that (and then we have the freedom of doing whatever, including applying markdown formatting of our own... like https://github.com/commonmark/commonmark-java). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
rmuir commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2757832416 awesome! I really hope the patch is accepted: this will definitely give us a path forward. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
rmuir commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2757929583 yeah: only thing I will say about choice of formatter is that normally I try to configure the editor in the repo to match what the build expects. e.g. for languages like python and typescript, it means disabling all formatting-related diagnostics and enabling "format-on-save" and "organize-imports-on-save", so that things "just work". It makes for easy dev experience where users have less friction with CI. But with java, such a setup seems difficult to achieve: I'm not sure anyone even has that today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] #14410 - Add Anytime Ranking Searching - SLA-constrained ranking With Range Boosting and Dynamic SLA [lucene]
atris commented on PR #14409: URL: https://github.com/apache/lucene/pull/14409#issuecomment-2758008422 My bad - should have set some more context. In reference to https://github.com/apache/lucene/issues/13675 The paper referred to by Adrien has a component on anytime search - SLA based termination coupled with optional docID range based boosting. This PR is an implementation of the same -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] #14410 - Add Anytime Ranking Searching - SLA-constrained ranking With Range Boosting and Dynamic SLA [lucene]
benwtrent commented on PR #14409: URL: https://github.com/apache/lucene/pull/14409#issuecomment-2758025254 @atris Ah, thank you, I will take a look at the paper first. Do you have any benchmarking to replicate the paper's findings within the Apache Lucene context? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] #14410 - Add Anytime Ranking Searching - SLA-constrained ranking With Range Boosting and Dynamic SLA [lucene]
benwtrent commented on PR #14409: URL: https://github.com/apache/lucene/pull/14409#issuecomment-2757970091 @atris I am gonna be frank, I haven't a clue what this is doing :D. Why is this being added? What is it supposed to accomplish? Maybe there is context I am missing... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] skip keyword in German Normalization Filter [lucene]
rmuir commented on PR #14416: URL: https://github.com/apache/lucene/pull/14416#issuecomment-2757816185 i think it will introduce a ton more complexity: that's why we've pushed back on doing this for anything that isn't a stemmer. otherwise people will want LowerCaseFilter to respect it too... and then indexing performance is dead. If you want to change the behavior of GermanAnalyzer then just make a CustomAnalyzer with ProtectedTermsFilter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add Issue Tracker Link under 'Editing Content on the Lucene™ Sites' [lucene-site]
dweiss commented on code in PR #78: URL: https://github.com/apache/lucene-site/pull/78#discussion_r2003502701 ## content/pages/site-instructions.md: ## @@ -3,8 +3,10 @@ URL: site-instructions.html save_as: site-instructions.html template: lucene/tlp/page + ## Editing Content on the Lucene™ sites +The web site is hosted in its own git repository **lucene-site** (see [Github](https://github.com/apache/lucene-site) and [Gitbox](https://gitbox.apache.org/repos/asf?p=lucene-site.git)). -The web site is hosted in its own git repository `lucene-site` (see [Github](https://github.com/apache/lucene-site/) and [Gitbox](https://gitbox.apache.org/repos/asf/lucene-site.git)). +Pushing to the `main` branch will update the staging site while pushing to `production` branch will update the main web site. Read the `README.md` file for further instructions. -Pushing to the `main` branch will update the [staging site](https://lucene.staged.apache.org) while pushing to `production` branch will update the main web site. Read the [README.md](https://github.com/apache/lucene-site/blob/main/README.md) file for further instructions. +For reporting website-related issues or suggesting improvements, please visit our [Issue Tracker](https://issues.apache.org/jira/projects/LUCENE). Review Comment: The issue tracker is no longer Jira - it should be github issues on this project (lucene-site), perhaps? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
jpountz commented on code in PR #14273: URL: https://github.com/apache/lucene/pull/14273#discussion_r2017546602 ## lucene/core/src/java/org/apache/lucene/search/DISIDocIdStream.java: ## @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import org.apache.lucene.util.FixedBitSet; + +final class DISIDocIdStream extends DocIdStream { + + private final DocIdSetIterator iterator; + private final int to; + private final FixedBitSet spare; + + DISIDocIdStream(DocIdSetIterator iterator, int to, FixedBitSet spare) { Review Comment: I'm leaning towards waiting until there's a need for it before adding it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
jpountz commented on code in PR #14273: URL: https://github.com/apache/lucene/pull/14273#discussion_r2017552582 ## lucene/core/src/java/org/apache/lucene/search/BooleanScorer.java: ## @@ -207,8 +164,32 @@ private void scoreWindowIntoBitSetAndReplay( acceptDocs.applyMask(matching, base); } -docIdStreamView.base = base; -collector.collect(docIdStreamView); +if (buckets == null) { + if (acceptDocs != null) { +// In this case, live docs have not been applied yet. +acceptDocs.applyMask(matching, base); + } Review Comment: Thanks for catching, this bit of code was mistakenly duplicated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
jpountz commented on code in PR #14273: URL: https://github.com/apache/lucene/pull/14273#discussion_r2017556183 ## lucene/core/src/java/org/apache/lucene/search/BitSetDocIdStream.java: ## @@ -0,0 +1,60 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import org.apache.lucene.util.FixedBitSet; + +final class BitSetDocIdStream extends DocIdStream { + + private final FixedBitSet bitSet; + private final int offset, max; + private int upTo; + + BitSetDocIdStream(FixedBitSet bitSet, int offset) { +this.bitSet = bitSet; +this.offset = offset; +upTo = offset; +max = (int) Math.min(Integer.MAX_VALUE, (long) offset + bitSet.length()); + } + + @Override + public boolean mayHaveRemaining() { +return upTo < max; + } + + @Override + public void forEach(int upTo, CheckedIntConsumer consumer) throws IOException { +if (upTo >= this.upTo) { Review Comment: (For reference, it's not as much for short-circuiting as for the fact that logic under the if block would fail if trying to move backwards. But since we're doing a check anyway, I agree it's also worth excluding the trivial case when the range of docs to collect is empty.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]
msfroh commented on PR #14413: URL: https://github.com/apache/lucene/pull/14413#issuecomment-2759357863 Does it make sense to create a separate `QueryProfilerBreakDown` per leaf? Or should it create one per slice? Can this be implemented as part of the `CollectorManager#newCollector` logic? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
jpountz commented on code in PR #14273: URL: https://github.com/apache/lucene/pull/14273#discussion_r2017544601 ## lucene/core/src/java/org/apache/lucene/search/DocIdStream.java: ## @@ -34,12 +33,34 @@ protected DocIdStream() {} * Iterate over doc IDs contained in this stream in order, calling the given {@link * CheckedIntConsumer} on them. This is a terminal operation. */ - public abstract void forEach(CheckedIntConsumer consumer) throws IOException; + public void forEach(CheckedIntConsumer consumer) throws IOException { +forEach(DocIdSetIterator.NO_MORE_DOCS, consumer); + } + + /** + * Iterate over doc IDs contained in this doc ID stream up to the given {@code upTo} exclusive, + * calling the given {@link CheckedIntConsumer} on them. It is not possible to iterate these doc + * IDs again later on. + */ + public abstract void forEach(int upTo, CheckedIntConsumer consumer) + throws IOException; /** Count the number of entries in this stream. This is a terminal operation. */ public int count() throws IOException { -int[] count = new int[1]; -forEach(_ -> count[0]++); -return count[0]; +return count(DocIdSetIterator.NO_MORE_DOCS); } + + /** + * Count the number of doc IDs in this stream that are below the given {@code upTo}. These doc IDs + * may not be consumed again later. + */ + // Note: it's abstract rather than having a default impl that delegates to #forEach because doing Review Comment: Ah right, it was based on your previous feedback. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow skip cache factor to be updated dynamically [lucene]
sgup432 commented on code in PR #14412: URL: https://github.com/apache/lucene/pull/14412#discussion_r2017693759 ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -122,12 +123,30 @@ public LRUQueryCache( long maxRamBytesUsed, Predicate leavesToCache, float skipCacheFactor) { +this(maxSize, maxRamBytesUsed, leavesToCache, new AtomicReference<>(skipCacheFactor)); + } + + /** + * Additionally, allows the ability to pass skipCacheFactor in form of AtomicReference where the + * caller can dynamically update(in a thread safe way) its value by calling skipCacheFactor.set() + * on their end. + */ + public LRUQueryCache( + int maxSize, + long maxRamBytesUsed, + Predicate leavesToCache, + AtomicReference skipCacheFactor) { Review Comment: I actually made it a AtomicReference so that a caller can create that on their end and pass this reference here. Via which the caller can change its value dynamically. Other way to set skipFactor dynamically would have been getter/setter like you mentioned below. But then one cannot use an interface `QueryCache` to initialize `LRUQueryCache`, and they are forced to use concrete implementation(to get/set) which I think is not right. ``` QueryCache lruCache = new LRUQueryCache(); // Cannot accesss getter/setter until unless I typecast or use concrete implementation directly. ``` Let me know what you think. ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -122,12 +123,30 @@ public LRUQueryCache( long maxRamBytesUsed, Predicate leavesToCache, float skipCacheFactor) { +this(maxSize, maxRamBytesUsed, leavesToCache, new AtomicReference<>(skipCacheFactor)); + } + + /** + * Additionally, allows the ability to pass skipCacheFactor in form of AtomicReference where the + * caller can dynamically update(in a thread safe way) its value by calling skipCacheFactor.set() + * on their end. + */ + public LRUQueryCache( + int maxSize, + long maxRamBytesUsed, + Predicate leavesToCache, + AtomicReference skipCacheFactor) { Review Comment: I actually made it a AtomicReference so that a caller can create that on their end and pass this reference here. Via which the caller can change its value dynamically. Other way to set skipFactor dynamically would have been via getter/setter like you mentioned below. But then one cannot use an interface `QueryCache` to initialize `LRUQueryCache`, and they are forced to use concrete implementation(to get/set) which I think is not right. ``` QueryCache lruCache = new LRUQueryCache(); // Cannot accesss getter/setter until unless I typecast or use concrete implementation directly. ``` Let me know what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] A specialized Trie for Block Tree Index [lucene]
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r2006889147 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/TrieBuilder.java: ## @@ -0,0 +1,552 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene90.blocktree; + +import java.io.IOException; +import java.util.ArrayDeque; +import java.util.Deque; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.List; +import java.util.ListIterator; +import java.util.function.BiConsumer; +import org.apache.lucene.store.DataOutput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; + +/** TODO make it a more memory efficient structure */ +class TrieBuilder { + + static final int SIGN_NO_CHILDREN = 0x00; + static final int SIGN_SINGLE_CHILD_WITH_OUTPUT = 0x01; + static final int SIGN_SINGLE_CHILD_WITHOUT_OUTPUT = 0x02; + static final int SIGN_MULTI_CHILDREN = 0x03; + + static final int LEAF_NODE_HAS_TERMS = 1 << 5; + static final int LEAF_NODE_HAS_FLOOR = 1 << 6; + static final long NON_LEAF_NODE_HAS_TERMS = 1L << 1; + static final long NON_LEAF_NODE_HAS_FLOOR = 1L << 0; + + /** + * The output describing the term block the prefix point to. + * + * @param fp describes the on-disk terms block which a trie node points to. + * @param hasTerms A boolean which will be false if this on-disk block consists entirely of + * pointers to child blocks. + * @param floorData A {@link BytesRef} which will be non-null when a large block of terms sharing + * a single trie prefix is split into multiple on-disk blocks. + */ + record Output(long fp, boolean hasTerms, BytesRef floorData) {} + + private enum Status { +BUILDING, +SAVED, +DESTROYED + } + + private static class Node { + +// The utf8 digit that leads to this Node, 0 for root node +private final int label; +// The children listed in order by their utf8 label +private final LinkedList children; +// The output of this node. +private Output output; + +// Vars used during saving: + +// The file pointer point to where the node saved. -1 means the node has not been saved. +private long fp = -1; +// The iterator whose next() point to the first child has not been saved. +private Iterator childrenIterator; + +Node(int label, Output output, LinkedList children) { + this.label = label; + this.output = output; + this.children = children; +} + } + + private Status status = Status.BUILDING; + final Node root = new Node(0, null, new LinkedList<>()); + + static TrieBuilder bytesRefToTrie(BytesRef k, Output v) { +return new TrieBuilder(k, v); + } + + private TrieBuilder(BytesRef k, Output v) { +if (k.length == 0) { + root.output = v; + return; +} +Node parent = root; +for (int i = 0; i < k.length; i++) { + int b = k.bytes[i + k.offset] & 0xFF; + Output output = i == k.length - 1 ? v : null; + Node node = new Node(b, output, new LinkedList<>()); + parent.children.add(node); + parent = node; +} + } + + /** + * Absorb all (K, V) pairs from the given trie into this one. The given trie builder should not + * have key that already exists in this one, otherwise a {@link IllegalArgumentException } will be + * thrown and this trie will get destroyed. + * + * Note: the given trie will be destroyed after absorbing. + */ + void absorb(TrieBuilder trieBuilder) { +if (status != Status.BUILDING || trieBuilder.status != Status.BUILDING) { + throw new IllegalStateException("tries should be unsaved"); +} +// Use a simple stack to avoid recursion. +Deque stack = new ArrayDeque<>(); +stack.add(() -> absorb(this.root, trieBuilder.root, stack)); +while (!stack.isEmpty()) { + stack.pop().run(); +} +trieBuilder.status = Status.DESTROYED; + } + + private void absorb(Node n, Node add, Deque stack) { +assert n.label == add.label; +if (add.output != null) { + if (n.output != null) { +
Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]
jpountz commented on PR #14401: URL: https://github.com/apache/lucene/pull/14401#issuecomment-2759494063 > This would also benefit https://github.com/apache/lucene/pull/14273 I don't think so, or rather taking advantage of range collection shouldn't help more than what #14273 does with `RangeDocIdStream`? For clarity, this `collectRange` method is more useful for aggregating values (facet "recorders" if I use the terminology of the new faceting framework?). Implementations would need to consult the `DocValuesSkipper` to check if it has pre-aggregated data over ranges of doc IDs that are contained in the range of doc IDs passed to `collectRange`. These sub ranges could be aggregated in constant time, without having to iterate over docs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow skip cache factor to be updated dynamically [lucene]
jpountz commented on code in PR #14412: URL: https://github.com/apache/lucene/pull/14412#discussion_r2017616847 ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -99,7 +100,7 @@ public class LRUQueryCache implements QueryCache, Accountable { private final Map cache; private final ReentrantReadWriteLock.ReadLock readLock; private final ReentrantReadWriteLock.WriteLock writeLock; - private final float skipCacheFactor; + private final AtomicReference skipCacheFactor; Review Comment: Making it volatile should be enough? ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -122,12 +123,30 @@ public LRUQueryCache( long maxRamBytesUsed, Predicate leavesToCache, float skipCacheFactor) { +this(maxSize, maxRamBytesUsed, leavesToCache, new AtomicReference<>(skipCacheFactor)); + } + + /** + * Additionally, allows the ability to pass skipCacheFactor in form of AtomicReference where the + * caller can dynamically update(in a thread safe way) its value by calling skipCacheFactor.set() + * on their end. + */ + public LRUQueryCache( + int maxSize, + long maxRamBytesUsed, + Predicate leavesToCache, + AtomicReference skipCacheFactor) { Review Comment: Let's avoid exposing the implementation detail of how the skip cache factor is stored internally, ie. skipCacheFactor should be a regular `float` here? ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -142,6 +161,10 @@ public LRUQueryCache( missCount = new LongAdder(); } + AtomicReference getSkipCacheFactor() { +return skipCacheFactor; + } Review Comment: Can you add a regular getter/setter instead? ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -269,8 +292,8 @@ boolean requiresEviction() { } CacheAndCount get(Query key, IndexReader.CacheHelper cacheHelper) { -assert key instanceof BoostQuery == false; -assert key instanceof ConstantScoreQuery == false; +assert !(key instanceof BoostQuery); +assert !(key instanceof ConstantScoreQuery); Review Comment: Please undo these changes, these `== false` are on purpose (you'll find many of them across the code base) to be more readable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Leverage sparse doc value indexes for range and value facet collection [lucene]
jpountz commented on issue #14406: URL: https://github.com/apache/lucene/issues/14406#issuecomment-2759414888 Thanks for the explanation, that makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Revert "gh-12627: HnswGraphBuilder connects disconnected HNSW graph components (#13566)" [lucene]
txwei commented on PR #14411: URL: https://github.com/apache/lucene/pull/14411#issuecomment-2759555261 @benwtrent we haven't released this to prod yet. We spotted a severe perf regression with the degenerate case of all vectors being the same. Since we have enough customers using all kinds of embeddings in production, this is very likely to be an issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]
jpountz commented on PR #14413: URL: https://github.com/apache/lucene/pull/14413#issuecomment-2759377415 @jainankitk OK. In my opinion, it's more important to handle the concurrent and non-concurrent cases consistently than to save some overhead when searches are not concurrent. I'd really like non-concurrent search to look and feel like a concurrent search with a single slice running on a SameThreadExecutorService as far as profiling is concerned. So I wouldn't specialize the class hierarchy for concurrent vs. non-concurrent. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
gsmiller commented on code in PR #14273: URL: https://github.com/apache/lucene/pull/14273#discussion_r2017721363 ## lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java: ## @@ -204,6 +205,40 @@ public int cardinality() { return Math.toIntExact(tot); } + /** + * Return the number of set bits between indexes {@code from} inclusive and {@code to} exclusive. + */ + public int cardinality(int from, int to) { +Objects.checkFromToIndex(from, to, length()); + +int cardinality = 0; + +// First, align `from` with a word start, ie. a multiple of Long.SIZE (64) +if ((from & 0x3F) != 0) { + long bits = this.bits[from >> 6] >>> from; + int numBitsTilNextWord = -from & 0x3F; + if (to - from < numBitsTilNextWord) { +bits &= (1L << (to - from)) - 1L; +return Long.bitCount(bits); + } + cardinality += Long.bitCount(bits); + from += numBitsTilNextWord; + assert (from & 0x3F) == 0; +} + +for (int i = from >> 6, end = to >> 6; i < end; ++i) { + cardinality += Long.bitCount(bits[i]); +} + +// Now handle bits between the last complete word and to +if ((to & 0x3F) != 0) { + long bits = this.bits[to >> 6] << -to; Review Comment: minor: one other small comment. I noticed in the new `forEach` method you added you use `long bits = this.bits[to >> 6] & ((1L << to) - 1);`. Is the rationale here that you need to persist the correct number of trailing zeros in the `forEach` implementation but not in this case since you're doing a bit count? Is this approach (shifting by `-to`) more performant (I ask since you could use the same approach as `forEach` here too for consistency, so I assume you had a reason ;)) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Avoid reload block when seeking backward in SegmentTermsEnum. [lucene]
github-actions[bot] commented on PR #13253: URL: https://github.com/apache/lucene/pull/13253#issuecomment-2759878401 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
gsmiller commented on code in PR #14273: URL: https://github.com/apache/lucene/pull/14273#discussion_r2017767190 ## lucene/core/src/java/org/apache/lucene/search/DISIDocIdStream.java: ## @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import org.apache.lucene.util.FixedBitSet; + +final class DISIDocIdStream extends DocIdStream { + + private final DocIdSetIterator iterator; + private final int to; + private final FixedBitSet spare; + + DISIDocIdStream(DocIdSetIterator iterator, int to, FixedBitSet spare) { Review Comment: I think that's fair. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]
guojialiang92 commented on code in PR #14417: URL: https://github.com/apache/lucene/pull/14417#discussion_r2015958210 ## lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java: ## @@ -5037,4 +5037,47 @@ public void testDocValuesSkippingIndexWithoutDocValues() throws Exception { } } } + + public void testAdvanceSegmentInfosCounter() throws IOException { +Directory dir = newDirectory(); + +IndexWriter writer; +IndexReader reader; + +writer = new IndexWriter(dir, newIndexWriterConfig(new MockAnalyzer(random(; + +// add 10 documents +for (int i = 0; i < 10; i++) { + addDocWithIndex(writer, i); + writer.commit(); +} +long beforeAdvanceSegmentCounter = writer.getSegmentInfosCounter(); +writer.advanceSegmentInfosCounter(1); +assertEquals(beforeAdvanceSegmentCounter, writer.getSegmentInfosCounter()); Review Comment: I have modified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]
guojialiang92 commented on code in PR #14417: URL: https://github.com/apache/lucene/pull/14417#discussion_r2015951779 ## lucene/core/src/java/org/apache/lucene/index/IndexWriter.java: ## @@ -1427,6 +1427,24 @@ public synchronized void advanceSegmentInfosVersion(long newVersion) { changed(); } + /** + * If {@link SegmentInfos#counter} is below {@code newCounter} then update it to this value. + * + * @lucene.internal + */ + public synchronized void advanceSegmentInfosCounter(long newCounter) { +this.ensureOpen(); +if (segmentInfos.counter < newCounter) { + segmentInfos.counter = newCounter; +} + } + + /** Returns the {@link SegmentInfos#counter}. */ + public synchronized long getSegmentInfosCounter() { Review Comment: I have removed the synchronization to avoid affecting performance when it is heavily called in some scenarios. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]
guojialiang92 commented on code in PR #14417: URL: https://github.com/apache/lucene/pull/14417#discussion_r2015954336 ## lucene/core/src/java/org/apache/lucene/index/IndexWriter.java: ## @@ -1427,6 +1427,24 @@ public synchronized void advanceSegmentInfosVersion(long newVersion) { changed(); } + /** + * If {@link SegmentInfos#counter} is below {@code newCounter} then update it to this value. + * + * @lucene.internal + */ + public synchronized void advanceSegmentInfosCounter(long newCounter) { Review Comment: From the `IndexWriter#changed` interface definition, it should indeed be called. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]
guojialiang92 commented on code in PR #14417: URL: https://github.com/apache/lucene/pull/14417#discussion_r2015957918 ## lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java: ## @@ -5037,4 +5037,47 @@ public void testDocValuesSkippingIndexWithoutDocValues() throws Exception { } } } + + public void testAdvanceSegmentInfosCounter() throws IOException { +Directory dir = newDirectory(); + +IndexWriter writer; +IndexReader reader; + +writer = new IndexWriter(dir, newIndexWriterConfig(new MockAnalyzer(random(; + +// add 10 documents +for (int i = 0; i < 10; i++) { + addDocWithIndex(writer, i); + writer.commit(); +} +long beforeAdvanceSegmentCounter = writer.getSegmentInfosCounter(); +writer.advanceSegmentInfosCounter(1); +assertEquals(beforeAdvanceSegmentCounter, writer.getSegmentInfosCounter()); + +writer.advanceSegmentInfosCounter(1000); +assertEquals(1000, writer.getSegmentInfosCounter()); Review Comment: I have modified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] #14410 - Add Anytime Ranking Searching - SLA-constrained ranking With Range Boosting and Dynamic SLA [lucene]
atris commented on PR #14409: URL: https://github.com/apache/lucene/pull/14409#issuecomment-2756755022 @jpountz @benwtrent requesting your review please -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]
guojialiang92 commented on PR #14417: URL: https://github.com/apache/lucene/pull/14417#issuecomment-2757291661 Thanks for helping with the code review, I have made modifications according to the suggestions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]
guojialiang92 commented on code in PR #14417: URL: https://github.com/apache/lucene/pull/14417#discussion_r2015872376 ## lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java: ## @@ -5037,4 +5037,45 @@ public void testDocValuesSkippingIndexWithoutDocValues() throws Exception { } } } + + public void testAdvanceSegmentInfosCounter() throws IOException { +Directory dir = newDirectory(); + +IndexWriter writer; +IndexReader reader; + +writer = new IndexWriter(dir, newIndexWriterConfig(new MockAnalyzer(random(; + +// add 10 documents +for (int i = 0; i < 10; i++) { + addDocWithIndex(writer, i); + writer.commit(); +} +long beforeAdvanceSegmentCounter = writer.getSegmentInfosCounter(); +writer.advanceSegmentInfosCounter(1); +assertEquals(beforeAdvanceSegmentCounter, writer.getSegmentInfosCounter()); + +writer.advanceSegmentInfosCounter(1000); +assertEquals(1000, writer.getSegmentInfosCounter()); + +// add 40 documents +for (int i = 10; i < 50; i++) { + addDocWithIndex(writer, i); + writer.commit(); +} + +assertEquals(1041, writer.getSegmentInfosCounter()); Review Comment: You're right, I have already made modifications. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] skip keyword in German Normalization Filter [lucene]
xzhang9292 commented on PR #14416: URL: https://github.com/apache/lucene/pull/14416#issuecomment-2756971573 > This keyword is legacy, for stemmers not normalizers. Just use ProtectedTermFilter which works with any tokenfilter without requiring modification to its code? @rmuir Thank you for your comment. Yes, using ProtectedTermFilter could be an option when directly using German Normalization Filter. Our problem comes when we are using German Analyzer. The GermanNormalizer positioned before GermanStemmers, so even we put word like "Bär" in exclusion set, normalizer can still change it to "bar". I think when people put a word in exclusion set, they probably also want the normalizer to keep the word. Would it be beneficial that we let Normalizer also exclude keywords? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Support modifying segmentInfos.counter in IndexWriter [lucene]
vigyasharma commented on issue #14362: URL: https://github.com/apache/lucene/issues/14362#issuecomment-2756963298 Sounds good to me. Will watch out for your PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]
vigyasharma commented on code in PR #14417: URL: https://github.com/apache/lucene/pull/14417#discussion_r2015817265 ## lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java: ## @@ -5037,4 +5037,45 @@ public void testDocValuesSkippingIndexWithoutDocValues() throws Exception { } } } + + public void testAdvanceSegmentInfosCounter() throws IOException { +Directory dir = newDirectory(); + +IndexWriter writer; +IndexReader reader; + +writer = new IndexWriter(dir, newIndexWriterConfig(new MockAnalyzer(random(; + +// add 10 documents +for (int i = 0; i < 10; i++) { + addDocWithIndex(writer, i); + writer.commit(); +} +long beforeAdvanceSegmentCounter = writer.getSegmentInfosCounter(); +writer.advanceSegmentInfosCounter(1); +assertEquals(beforeAdvanceSegmentCounter, writer.getSegmentInfosCounter()); + +writer.advanceSegmentInfosCounter(1000); +assertEquals(1000, writer.getSegmentInfosCounter()); + +// add 40 documents +for (int i = 10; i < 50; i++) { + addDocWithIndex(writer, i); + writer.commit(); +} + +assertEquals(1041, writer.getSegmentInfosCounter()); Review Comment: I think it's sufficient to assert that the counter value is `>=1000` i.e. higher than what you advanced it to. `SegmentInfos.counter` is public and gets incremented each time a segment is named, including background merges. Asserting on an exact value can make the test flaky. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]
vigyasharma commented on code in PR #14417: URL: https://github.com/apache/lucene/pull/14417#discussion_r2015825615 ## lucene/core/src/java/org/apache/lucene/index/IndexWriter.java: ## @@ -1427,6 +1427,24 @@ public synchronized void advanceSegmentInfosVersion(long newVersion) { changed(); } + /** + * If {@link SegmentInfos#counter} is below {@code newCounter} then update it to this value. + * + * @lucene.internal + */ + public synchronized void advanceSegmentInfosCounter(long newCounter) { Review Comment: Do we need to call `changed()` or increment change count here? ## lucene/core/src/java/org/apache/lucene/index/IndexWriter.java: ## @@ -1427,6 +1427,24 @@ public synchronized void advanceSegmentInfosVersion(long newVersion) { changed(); } + /** + * If {@link SegmentInfos#counter} is below {@code newCounter} then update it to this value. + * + * @lucene.internal + */ + public synchronized void advanceSegmentInfosCounter(long newCounter) { +this.ensureOpen(); +if (segmentInfos.counter < newCounter) { + segmentInfos.counter = newCounter; +} + } + + /** Returns the {@link SegmentInfos#counter}. */ + public synchronized long getSegmentInfosCounter() { Review Comment: Do we really need this synchronized? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support modifying segmentInfos.counter in IndexWriter [lucene]
vigyasharma commented on code in PR #14417: URL: https://github.com/apache/lucene/pull/14417#discussion_r2015832195 ## lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java: ## @@ -5037,4 +5037,47 @@ public void testDocValuesSkippingIndexWithoutDocValues() throws Exception { } } } + + public void testAdvanceSegmentInfosCounter() throws IOException { +Directory dir = newDirectory(); + +IndexWriter writer; +IndexReader reader; + +writer = new IndexWriter(dir, newIndexWriterConfig(new MockAnalyzer(random(; + +// add 10 documents +for (int i = 0; i < 10; i++) { + addDocWithIndex(writer, i); + writer.commit(); +} +long beforeAdvanceSegmentCounter = writer.getSegmentInfosCounter(); +writer.advanceSegmentInfosCounter(1); +assertEquals(beforeAdvanceSegmentCounter, writer.getSegmentInfosCounter()); + +writer.advanceSegmentInfosCounter(1000); +assertEquals(1000, writer.getSegmentInfosCounter()); Review Comment: Do we need this check? Let's assert with `gte` to avoid flakiness. ## lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java: ## @@ -5037,4 +5037,47 @@ public void testDocValuesSkippingIndexWithoutDocValues() throws Exception { } } } + + public void testAdvanceSegmentInfosCounter() throws IOException { +Directory dir = newDirectory(); + +IndexWriter writer; +IndexReader reader; + +writer = new IndexWriter(dir, newIndexWriterConfig(new MockAnalyzer(random(; + +// add 10 documents +for (int i = 0; i < 10; i++) { + addDocWithIndex(writer, i); + writer.commit(); +} +long beforeAdvanceSegmentCounter = writer.getSegmentInfosCounter(); +writer.advanceSegmentInfosCounter(1); +assertEquals(beforeAdvanceSegmentCounter, writer.getSegmentInfosCounter()); Review Comment: Let's add a similar `gte` check here as in the assert below, instead of assuming that counter value is 1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
jpountz commented on PR #14273: URL: https://github.com/apache/lucene/pull/14273#issuecomment-2758283399 I played with the geonames dataset, by filtering out docs that don't have a value for the `elevation` field (2.3M docs left), enabling index sorting on the `elevation` field and computing histograms on the `elevation` field with a bucket width of 100. | Query | | Latency on main (ms) | Latency on branch (ms) | | - | - | - | - | | `MatchAllDocsQuery` | Uses `RangeDocIdStream` under the hood | 6.9 | 4.3 | | `featureClass:(S P)` | Matches spots or cities, 1.2 matching docs, uses `BitSetDocIdStream` under the hood | 4.8 | 2.4 | I also checked wikibigall, no slowdowns were detected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
dweiss commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2758276309 You're right - this may be a problem. I use intellij and even gjf there does not work exactly the same way (https://github.com/google/google-java-format/pull/1165). I stopped caring after a while. Just do whatever I need, then run a tidy before committing. I know - not ideal. But at least a workable solution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make PointValues.intersect iterative instead of recursive [lucene]
iverase commented on code in PR #14391: URL: https://github.com/apache/lucene/pull/14391#discussion_r2009830096 ## lucene/core/src/java/org/apache/lucene/index/PointValues.java: ## @@ -351,35 +351,32 @@ public final void intersect(IntersectVisitor visitor) throws IOException { assert pointTree.moveToParent() == false; } - private void intersect(IntersectVisitor visitor, PointTree pointTree) throws IOException { -Relation r = visitor.compare(pointTree.getMinPackedValue(), pointTree.getMaxPackedValue()); -switch (r) { - case CELL_OUTSIDE_QUERY: -// This cell is fully outside the query shape: stop recursing -break; - case CELL_INSIDE_QUERY: + private static void intersect(IntersectVisitor visitor, PointTree pointTree) throws IOException { +int depth = 0; Review Comment: I don' think we need this variable, what do you think about this: ``` public final void intersect(IntersectVisitor visitor) throws IOException { final PointTree pointTree = getPointTree(); while (true) { Relation compare = visitor.compare(pointTree.getMinPackedValue(), pointTree.getMaxPackedValue()); if (compare == Relation.CELL_INSIDE_QUERY) { // This cell is fully inside the query shape: recursively add all points in this cell // without filtering pointTree.visitDocIDs(visitor); } else if (compare == Relation.CELL_CROSSES_QUERY) { // The cell crosses the shape boundary, or the cell fully contains the query, so we fall // through and do full filtering: if (pointTree.moveToChild()) { continue; } // TODO: we can assert that the first value here in fact matches what the pointTree // claimed? // Leaf node; scan and filter all points in this block: pointTree.visitDocValues(visitor); } while (pointTree.moveToSibling() == false) { if (pointTree.moveToParent() == false) { // we are at the root, finish. return; } } } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] #14410 - Add Anytime Ranking Searching - SLA-constrained ranking With Range Boosting and Dynamic SLA [lucene]
jpountz commented on PR #14409: URL: https://github.com/apache/lucene/pull/14409#issuecomment-2758339188 I don't think that the timout support that you are introducing buys anything compared with the existing timeout support via `IndexSearcher#setTimeout` and `TimeLimitingBulkScorer`. To me the interesting part is the ability to estimate the maximum possible score for various ranges of the doc ID space (there is an assumption that something like recursive graph bisection is enabled at index time) so that they can be searched first in order to help dynamic pruning, but it looks like this bit is left to the user in your PR. It would be interesting to look into doing this in Lucene by looking at impact blocks (`Scorer#advanceShallow`, `Scorer#getMaxScore`) for the various disjunctive clauses. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]
jainankitk commented on PR #14413: URL: https://github.com/apache/lucene/pull/14413#issuecomment-2758741304 > Can you explain why we need two impls? I would have assumed that the ConcurrentQueryProfilerBreakdown could also be used for searches that are not concurrent? `ConcurrentQueryProfilerBreakdown` maintains separate instance of `QueryProfileBreakdown` for each segment. In the context method, `ConcurrentQueryProfilerBreakdown` returns corresponding `QueryProfileBreakdown` object for each segment. ``` @Override public QueryProfilerBreakdown context(LeafReaderContext context) { return this; } ``` Hence, I felt we don't need to take the overhead of creating breakdown per segment and then merging it together during response. That being said, eventually we can keep just `ConcurrentQueryProfilerBreakdown` if we prefer that for simplicity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]
gsmiller commented on PR #14401: URL: https://github.com/apache/lucene/pull/14401#issuecomment-2758696779 It makes sense to me to expose the idea of doc range collection as a first-class API on leaf collectors for the reasons you outlined above. This would also benefit #14273 as well right? If there are scorers that can leverage the range collection call, it would immediately benefit I believe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] skip keyword in German Normalization Filter [lucene]
xzhang9292 closed pull request #14416: skip keyword in German Normalization Filter URL: https://github.com/apache/lucene/pull/14416 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow skip cache factor to be updated dynamically [lucene]
sgup432 commented on code in PR #14412: URL: https://github.com/apache/lucene/pull/14412#discussion_r2017693759 ## lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java: ## @@ -122,12 +123,30 @@ public LRUQueryCache( long maxRamBytesUsed, Predicate leavesToCache, float skipCacheFactor) { +this(maxSize, maxRamBytesUsed, leavesToCache, new AtomicReference<>(skipCacheFactor)); + } + + /** + * Additionally, allows the ability to pass skipCacheFactor in form of AtomicReference where the + * caller can dynamically update(in a thread safe way) its value by calling skipCacheFactor.set() + * on their end. + */ + public LRUQueryCache( + int maxSize, + long maxRamBytesUsed, + Predicate leavesToCache, + AtomicReference skipCacheFactor) { Review Comment: I actually made it a AtomicReference so that a caller can create that on their end and pass this reference here. Via which the caller can change its value dynamically. Other way to set it dynamically would have been getter/setter like you mentioned below. But then one cannot use an interface `QueryCache` to initialize `LRUQueryCache`, and they are forced to use concrete implementation(to get/set) which I think is not right. ``` QueryCache lruCache = new LRUQueryCache(); // Cannot accesss getter/setter until unless I typecast or use concrete implementation directly. ``` Let me know what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] quick exit on filter query matching no docs when rewriting knn query [lucene]
bugmakerr opened a new pull request, #14418: URL: https://github.com/apache/lucene/pull/14418 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix TestHnswByteVectorGraph.testBuildingJoinSet [lucene]
mayya-sharipova merged PR #14398: URL: https://github.com/apache/lucene/pull/14398 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]
jainankitk commented on PR #14413: URL: https://github.com/apache/lucene/pull/14413#issuecomment-2759703236 > Does it make sense to create a separate QueryProfilerBreakDown per leaf? Or should it create one per slice? Actually, create one per slice makes lot of sense. > Can this be implemented as part of the ProfilerCollectorManager#newCollector logic? Maybe not, since we would also need to profile the work done by the Weight + Scorer on each slice We can always use `ThreadId` for correctly tracking the mapping from slice to `QueryProfilerBreakdown` object. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]
jainankitk commented on PR #14413: URL: https://github.com/apache/lucene/pull/14413#issuecomment-2759709664 > In my opinion, it's more important to handle the concurrent and non-concurrent cases consistently than to save some overhead when searches are not concurrent. I'd really like non-concurrent search to look and feel like a concurrent search with a single slice running on a SameThreadExecutorService as far as profiling is concerned. Let me try and see if we can maintain per slice `QueryProfilerBreakdown` object. With that, both concurrent and non-concurrent paths would be consistent as well as efficient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
gsmiller commented on code in PR #14273: URL: https://github.com/apache/lucene/pull/14273#discussion_r2017239514 ## lucene/core/src/java/org/apache/lucene/search/DISIDocIdStream.java: ## @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import org.apache.lucene.util.FixedBitSet; + +final class DISIDocIdStream extends DocIdStream { + + private final DocIdSetIterator iterator; + private final int to; Review Comment: minor: maybe `max` to be consistent with the other implementations? ## lucene/core/src/java/org/apache/lucene/search/DocIdStream.java: ## @@ -34,12 +33,34 @@ protected DocIdStream() {} * Iterate over doc IDs contained in this stream in order, calling the given {@link * CheckedIntConsumer} on them. This is a terminal operation. */ - public abstract void forEach(CheckedIntConsumer consumer) throws IOException; + public void forEach(CheckedIntConsumer consumer) throws IOException { +forEach(DocIdSetIterator.NO_MORE_DOCS, consumer); + } + + /** + * Iterate over doc IDs contained in this doc ID stream up to the given {@code upTo} exclusive, + * calling the given {@link CheckedIntConsumer} on them. It is not possible to iterate these doc + * IDs again later on. + */ + public abstract void forEach(int upTo, CheckedIntConsumer consumer) + throws IOException; /** Count the number of entries in this stream. This is a terminal operation. */ public int count() throws IOException { -int[] count = new int[1]; -forEach(_ -> count[0]++); -return count[0]; +return count(DocIdSetIterator.NO_MORE_DOCS); } + + /** + * Count the number of doc IDs in this stream that are below the given {@code upTo}. These doc IDs + * may not be consumed again later. + */ + // Note: it's abstract rather than having a default impl that delegates to #forEach because doing Review Comment: Thanks for adding this comment! +1 to the rationale as well. ## lucene/core/src/java/org/apache/lucene/search/DISIDocIdStream.java: ## @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import org.apache.lucene.util.FixedBitSet; + +final class DISIDocIdStream extends DocIdStream { + + private final DocIdSetIterator iterator; + private final int to; + private final FixedBitSet spare; + + DISIDocIdStream(DocIdSetIterator iterator, int to, FixedBitSet spare) { Review Comment: Would it make sense to offer another ctor that takes care of allocating `spare`? I suppose there's not actually a use-case for it right now since the one calling code point of this currently has a bit set that can be reused, so maybe not worth adding until there's an actual reason. ## lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java: ## @@ -204,6 +205,39 @@ public int cardinality() { return Math.toIntExact(tot); } + /** + * Return the number of set bits between indexes {@code from} inclusive and {@code to} exclusive. + */ + public int cardinality(int from, int to) { +Objects.checkFromToIndex(from, to, length()); + +int cardinality = 0; + +// First, align `from` with a word start, ie. a multiple of Long.SIZE (64) +if ((from & 0x3F) != 0) { + long bits = this.bits[from >> 6] >>> from; + int
Re: [I] Leverage sparse doc value indexes for range and value facet collection [lucene]
gsmiller commented on issue #14406: URL: https://github.com/apache/lucene/issues/14406#issuecomment-2759371300 > Out of curiosity, is it common for the union of the configured ranges to only match a small subset of the index? I would naively expect users to want to collect stats about all their data, so there would be one open-ended range as a "case else" and such an optimization would never kick in in practice? Agreed it seems rare/unlikely but I have one such example of this we currently do in Amazon's product search engine. We support a "Customer Reviews" filter today that allows customers to filter by a product's "star rating", but the only option we currently expose to customers is "4 stars & up" (while star ratings for products run from 1 - 5 stars). We leverage faceting to compute aggregations over products with a review value of `4-`, and don't care about docs with < 4 stars (for the purpose of faceting). The problem with this example is that we're also faceting on other fields for the same match set, so even if we exposed a competitive iterator for the reviews case, it wouldn't actually let us skip anything in practice (at least not in the sandbox faceting module where we're computing many aggregations while collecting). But I suppose our traditional faceting implementation could leverage a sparse index directly to potentially skip over some blocks of documents. Agreed it's an unlikely optimization to have much value in practice :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
jpountz commented on code in PR #14273: URL: https://github.com/apache/lucene/pull/14273#discussion_r2017547274 ## lucene/core/src/java/org/apache/lucene/search/BitSetDocIdStream.java: ## @@ -0,0 +1,60 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import org.apache.lucene.util.FixedBitSet; + +final class BitSetDocIdStream extends DocIdStream { + + private final FixedBitSet bitSet; + private final int offset, max; + private int upTo; + + BitSetDocIdStream(FixedBitSet bitSet, int offset) { +this.bitSet = bitSet; +this.offset = offset; +upTo = offset; +max = (int) Math.min(Integer.MAX_VALUE, (long) offset + bitSet.length()); + } + + @Override + public boolean mayHaveRemaining() { +return upTo < max; + } + + @Override + public void forEach(int upTo, CheckedIntConsumer consumer) throws IOException { +if (upTo >= this.upTo) { Review Comment: You're right, I'll change this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Case-insensitive TermInSetQuery Implementation (Proof of Concept) [lucene]
github-actions[bot] commented on PR #14349: URL: https://github.com/apache/lucene/pull/14349#issuecomment-2759877032 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]
github-actions[bot] commented on PR #14267: URL: https://github.com/apache/lucene/pull/14267#issuecomment-2759877138 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Disable the query cache by default. [lucene]
github-actions[bot] commented on PR #14187: URL: https://github.com/apache/lucene/pull/14187#issuecomment-2759877242 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] PointInSetQuery clips segments by lower and upper [lucene]
github-actions[bot] commented on PR #14268: URL: https://github.com/apache/lucene/pull/14268#issuecomment-2759877100 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Terminate automaton when it can match all suffixes, and match suffixes directly. [lucene]
github-actions[bot] commented on PR #13072: URL: https://github.com/apache/lucene/pull/13072#issuecomment-2759878644 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org