Re: [PR] Add histogram facet capabilities. [lucene]
jpountz commented on PR #14204: URL: https://github.com/apache/lucene/pull/14204#issuecomment-2650473311 I moved the code to the sandbox facet framework and applied suggestions, I hope I didn't miss any. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Bugfix/fix hnsw search termination check [lucene]
jimczi commented on code in PR #14215: URL: https://github.com/apache/lucene/pull/14215#discussion_r1950877363 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -52,7 +52,7 @@ public boolean collect(int docId, float similarity) { @Override public float minCompetitiveSimilarity() { -return queue.size() >= k() ? queue.topScore() : Float.NEGATIVE_INFINITY; +return queue.size() >= k() ? Math.nextUp(queue.topScore()) : Float.NEGATIVE_INFINITY; Review Comment: It was more about the high-level contract of minCompetitiveSimilarity. It's not a big deal, but it does place the responsibility on the various implementers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves [lucene]
iverase commented on PR #14176: URL: https://github.com/apache/lucene/pull/14176#issuecomment-2650518843 >The [nightly geo benchy](https://benchmarks.mikemccandless.com/geobench.html) didn't seem impacted either way; maybe the tasks it runs are not exercising the optimized path here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add histogram facet capabilities. [lucene]
epotyom commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1950866681 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.histogram; + +import java.io.IOException; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesSkipper; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.NumericDocValues; +import org.apache.lucene.index.SortedNumericDocValues; +import org.apache.lucene.internal.hppc.LongIntHashMap; +import org.apache.lucene.search.CollectionTerminatedException; +import org.apache.lucene.search.Collector; +import org.apache.lucene.search.LeafCollector; +import org.apache.lucene.search.Scorable; +import org.apache.lucene.search.ScoreMode; + +final class HistogramCollector implements Collector { Review Comment: > the API is designed around the idea of hierarchical facets I don't think hierarchy is a central piece of the API, it is more like hierarchical facets are also supported if needed. For many facet implementations it is not needed, such as ranges, distinct long values, etc. I briefly looked at the code and I think it fits into the sandbox module API use case. I believe all you need is to implement FacetCutter (`#createLeafCutter`) and LeafFacetCutter (`#advanceExact` and `#nextOrd`) interfaces, which return bucket indexes as facet ordinals. What you would get for free is counting, or computing other aggregations based on numeric fields for the doc (FacetRecorder implementations); sorting/topN buckets by count or aggregation (`OrdinaIterator`s). You could also implement `OrdToLabel` interface to do the math explained in HistogramCollectorManager's javadoc and return String representations of bucket ranges. That being said, if you don't think that this additional functionality is useful, then there is probably no reason to use the new API, which is more flexible at a cost of creating more abstractions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves [lucene]
mikemccand commented on PR #14176: URL: https://github.com/apache/lucene/pull/14176#issuecomment-2650500796 > Incredible speeds ups here https://benchmarks.mikemccandless.com/FilteredIntNRQ.html and here https://benchmarks.mikemccandless.com/IntNRQ.html Yeah, wow! > These numbers look great! I want to run this change on the geo benchmarks but I expect similar speed ups. I am planning to do it early next week. The [nightly geo benchy](https://benchmarks.mikemccandless.com/geobench.html) didn't seem impacted either way; maybe the tasks it runs are not exercising the optimized path here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add UnwrappingReuseStrategy for AnalyzerWrapper [lucene]
mayya-sharipova commented on PR #14154: URL: https://github.com/apache/lucene/pull/14154#issuecomment-2651354310 @jpountz Thank you for your feedback. I am happy to discuss alternatives. Isn't the whole idea of ReuseStrategy to decide if an analyzer's components can be reused or should be recreated/swapped? > I worry that this is just one place where the assumption that analysis components are immutable is violated, and that there are other ones that haven't been found yet. I'm curious if other alternatives were considered, e.g. atomically swapping the analyzer or updating the behavior of analysis components without needing to replace them. Yes, indeed in this PR analysis components are immutable, they are recreated/ swapped in an analyzer For the context: this PR came from a need to have an updateable synonyms analyzer when it is wrapped in another analyzer, for example CompletionAnalyzer. In Elasticsearch an analyzer with updateable synonyms has a custom ReuseStrategy that creates a new components when synonyms are updated. We now want CompletionAnalyzer also update its components when its wrapped synonyms analyzer updated synonyms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]
benwtrent commented on issue #14208: URL: https://github.com/apache/lucene/issues/14208#issuecomment-2650772079 @Vikasht34 I thank you for helpful suggestions, but these just seem like rehashes of already made suggestions or things that are completely unrelated. Were these generated via some LLM or something? One idea I had was we eagerly start delta encoding var-int connections as they reach a certain size. Maybe we get lucky and as the graph grows we are less likely to update a graphs connections and thus have to re-caculate and extend the graph connections. This will complicate the diversity checks for connections as it relies on vectors to be sorted by nearness :/ Maybe we delta encode their scores instead of the neighbor ordinals (crazy idea, I don't know how to do this)? Its conceivable that the scores for a given NSW will be generally close to eachother. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Bugfix/fix hnsw search termination check [lucene]
benwtrent commented on code in PR #14215: URL: https://github.com/apache/lucene/pull/14215#discussion_r1950821830 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -52,7 +52,7 @@ public boolean collect(int docId, float similarity) { @Override public float minCompetitiveSimilarity() { -return queue.size() >= k() ? queue.topScore() : Float.NEGATIVE_INFINITY; +return queue.size() >= k() ? Math.nextUp(queue.topScore()) : Float.NEGATIVE_INFINITY; Review Comment: I am fine with that as well. I suppose the only thing that actually uses this is the graph searcher :/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]
Vikasht34 commented on issue #14208: URL: https://github.com/apache/lucene/issues/14208#issuecomment-2651710710 @benwtrent Thanks ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up public non-final statics [lucene]
msokolov commented on PR #14221: URL: https://github.com/apache/lucene/pull/14221#issuecomment-2652083206 LGTM, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add UnwrappingReuseStrategy for AnalyzerWrapper [lucene]
jpountz commented on PR #14154: URL: https://github.com/apache/lucene/pull/14154#issuecomment-2652082591 I suspect that `ReuseStrategy` hasn't been built with the idea that it could be use to invalidate components, but only as a way to describe an efficient caching strategy for analysis components, typically whether analysis components can be cached globally (`GLOBAL_REUSE_STRATEGY`) or need a different cache entry on each field (`PER_FIELD_REUSE_STRATEGY`). @uschindler I believe that you are the one who designed this logic, I'd be interested in your thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up public non-final statics [lucene]
risdenk commented on code in PR #14221: URL: https://github.com/apache/lucene/pull/14221#discussion_r1951627505 ## lucene/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/smart/AnalyzerProfile.java: ## @@ -34,45 +34,44 @@ public class AnalyzerProfile { /** Global indicating the configured analysis data directory */ - public static String ANALYSIS_DATA_DIR = ""; + public static final String ANALYSIS_DATA_DIR; static { -init(); +ANALYSIS_DATA_DIR = init(); Review Comment: Does this need to be in the static block? cant the line above just be `public static final String ANALYSIS_DATA_DIR = init();` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up public non-final statics [lucene]
risdenk commented on PR #14221: URL: https://github.com/apache/lucene/pull/14221#issuecomment-2652103797 https://github.com/apache/lucene/pull/14216 may help with this specifically `NonFinalStaticField` which was off since it was super noisy. https://errorprone.info/bugpattern/NonFinalStaticField -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Consistent KNN query results with multiple leafs [lucene]
benwtrent commented on code in PR #14191: URL: https://github.com/apache/lucene/pull/14191#discussion_r1939664144 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -19,11 +19,7 @@ import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; import java.io.IOException; -import java.util.ArrayList; -import java.util.Arrays; -import java.util.Comparator; -import java.util.List; -import java.util.Objects; +import java.util.*; Review Comment: lets not use `*` ## lucene/core/src/java/org/apache/lucene/search/knn/TopKnnCollectorManager.java: ## @@ -55,8 +57,13 @@ public KnnCollector newCollector( if (globalScoreQueue == null) { return new TopKnnCollector(k, visitedLimit, searchStrategy); } else { - return new MultiLeafKnnCollector( - k, globalScoreQueue, new TopKnnCollector(k, visitedLimit, searchStrategy)); + if (freeze.getAndSet(false)) { +return new MultiLeafKnnCollector( +k, globalScoreQueue, new TopKnnCollector(k, visitedLimit, searchStrategy), false); + } else { Review Comment: I don't think you need to do this "freeze" thing. The only issue was sharing information between threads. A single executor continuing to share information is completely ok. ## lucene/core/src/java/org/apache/lucene/search/knn/TopKnnCollectorManager.java: ## @@ -55,8 +57,13 @@ public KnnCollector newCollector( if (globalScoreQueue == null) { return new TopKnnCollector(k, visitedLimit, searchStrategy); } else { - return new MultiLeafKnnCollector( - k, globalScoreQueue, new TopKnnCollector(k, visitedLimit, searchStrategy)); + if (freeze.getAndSet(false)) { +return new MultiLeafKnnCollector( +k, globalScoreQueue, new TopKnnCollector(k, visitedLimit, searchStrategy), false); + } else { Review Comment: For this change, we simply make `MultiLeafKnnCollector` not thread safe at all. We can continue to have this `globalScoreQueue` that is unique per thread and shared between all the collectors, but all the code assumes its within a single thread. This will likely speed things up significantly. ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -88,11 +91,64 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException { getKnnCollectorManager(k, indexSearcher), indexSearcher.getTimeout()); TaskExecutor taskExecutor = indexSearcher.getTaskExecutor(); List leafReaderContexts = reader.leaves(); + List> tasks = new ArrayList<>(leafReaderContexts.size()); -for (LeafReaderContext context : leafReaderContexts) { - tasks.add(() -> searchLeaf(context, filterWeight, knnCollectorManager)); + +TopDocs[] perLeafResults; +if (leafReaderContexts.size() > 1) { + if (true) { +/* sort LRCs by segment size */ +List sortedLeafReaderContexts = leafReaderContexts.stream() +.sorted(Comparator.comparingInt(o -> o.reader().numDocs())).toList(); +int noLRCs = sortedLeafReaderContexts.size(); Review Comment: I think this is OK for a POC, but we can easily sort actual number of vectors for real data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Clean up public non-final statics [lucene]
msfroh commented on code in PR #14221: URL: https://github.com/apache/lucene/pull/14221#discussion_r1951677461 ## lucene/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/smart/AnalyzerProfile.java: ## @@ -34,45 +34,44 @@ public class AnalyzerProfile { /** Global indicating the configured analysis data directory */ - public static String ANALYSIS_DATA_DIR = ""; + public static final String ANALYSIS_DATA_DIR; static { -init(); +ANALYSIS_DATA_DIR = init(); Review Comment: You are absolutely right! Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] org.apache.lucene.search.TestKnnFloatVectorQuery.testFindFewer ComparisonFailure: expected: but was: [lucene]
navneet1v commented on issue #14175: URL: https://github.com/apache/lucene/issues/14175#issuecomment-2650199394 @ChrisHegarty I did some debugging on this failed test and what I found is for the particular see the Lucene99ScalarQuantizedVectorsFormat is getting picked from a list of formats which gets created here (https://github.com/apache/lucene/blob/6b0112cdee284cae9143402d7dd80fdaf4d22f93/lucene/test-framework/src/java/org/apache/lucene/tests/index/RandomCodec.java#L267). Now due to this Quantized format, scores of 0th doc and 2nd doc becomes same causing the failure, as it is not guaranteed which doc will come first. Code ref: https://github.com/apache/lucene/blob/6b0112cdee284cae9143402d7dd80fdaf4d22f93/lucene/test-framework/src/java/org/apache/lucene/tests/index/RandomCodec.java#L210-L219 One way I think we can fix this is rather than checking the order of docs, we should compare if all the docs we expect are present or not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Unicode Support for Case Insensitive Matching in RegExp [lucene]
john-wagster commented on PR #14192: URL: https://github.com/apache/lucene/pull/14192#issuecomment-2651490994 @rmuir I made another pass based on your feedback and I'm good with and agree to keep this simple for a first pass. To that end I've done the following: * CaseFolding is no longer externally exposed (can revisit this later) * CaseFolding has one method to get the set of alternates renamed from `fold` which was inaccurate previously. The idea being to introduce a `fold` method in the future if it's valuable. * CaseFolding only includes the set of edge cases that are not handled by `Character.toLowerCase` and `Character.toUpperCase` and is a switch/case statement now with hex codepoints and comments on ever character * this is similar to ASCIIFoldingFilter * it was built using https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt and filtered by the set of things not already covered by `Character.toLowerCase` and `Character.toUpperCase` * The utility class CaseFoldingUtil for generating the switch / case statement code in the CaseFolding class has been removed. * ASCII_CASE_INSENSITIVE flag has been deprecated and has the same behavior as the CASE_INSENSITIVE flag * Tried to minimize changes otherwise with less concern for performance during compilation of the regex Thoughts at this point and whether I've adequately incorporated your feedback would be much appreciated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add histogram facet capabilities. [lucene]
gsmiller commented on code in PR #14204: URL: https://github.com/apache/lucene/pull/14204#discussion_r1951323035 ## lucene/facet/src/java/org/apache/lucene/facet/histogram/HistogramCollector.java: ## @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.histogram; + +import java.io.IOException; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.DocValuesSkipper; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.NumericDocValues; +import org.apache.lucene.index.SortedNumericDocValues; +import org.apache.lucene.internal.hppc.LongIntHashMap; +import org.apache.lucene.search.CollectionTerminatedException; +import org.apache.lucene.search.Collector; +import org.apache.lucene.search.LeafCollector; +import org.apache.lucene.search.Scorable; +import org.apache.lucene.search.ScoreMode; + +final class HistogramCollector implements Collector { + + private final String field; + private final long interval; + private final LongIntHashMap counts; + + HistogramCollector(String field, long interval) { +this.field = field; +this.interval = interval; +this.counts = new LongIntHashMap(); + } + + @Override + public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { +FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field); +if (fi == null) { + // The segment has no values, nothing to do. + throw new CollectionTerminatedException(); +} +if (fi.getDocValuesType() != DocValuesType.NUMERIC +&& fi.getDocValuesType() != DocValuesType.SORTED_NUMERIC) { + throw new IllegalStateException( + "Expected numeric field, but got doc-value type: " + fi.getDocValuesType()); +} +SortedNumericDocValues values = DocValues.getSortedNumeric(context.reader(), field); +NumericDocValues singleton = DocValues.unwrapSingleton(values); +if (singleton == null) { + return new HistogramNaiveLeafCollector(values, interval, counts); +} else { + DocValuesSkipper skipper = context.reader().getDocValuesSkipper(field); + if (skipper != null) { +long leafMinQuotient = Math.floorDiv(skipper.minValue(), interval); +long leafMaxQuotient = Math.floorDiv(skipper.maxValue(), interval); +if (leafMaxQuotient - leafMinQuotient <= 1024) { + // Only use the optimized implementation if there is a small number of unique quotients, + // so that we can count them using a dense array instead of a hash table. + return new HistogramLeafCollector(singleton, skipper, interval, counts); +} + } + return new HistogramNaiveSingleValuedLeafCollector(singleton, interval, counts); +} + } + + @Override + public ScoreMode scoreMode() { +return ScoreMode.COMPLETE_NO_SCORES; + } + + LongIntHashMap getCounts() { +return counts; + } + + /** + * Naive implementation of a histogram {@link LeafCollector}, which iterates all maches and looks + * up the value to determine the corresponding bucket. + */ + private static class HistogramNaiveLeafCollector implements LeafCollector { + +private final SortedNumericDocValues values; +private final long interval; +private final LongIntHashMap counts; + +HistogramNaiveLeafCollector( +SortedNumericDocValues values, long interval, LongIntHashMap counts) { + this.values = values; + this.interval = interval; + this.counts = counts; +} + +@Override +public void setScorer(Scorable scorer) throws IOException {} + +@Override +public void collect(int doc) throws IOException { + if (values.advanceExact(doc)) { +int valueCount = values.docValueCount(); +long prevQuotient = Long.MIN_VALUE; +for (int i = 0; i < valueCount; ++i) { + final long value = values.nextValue(); + final long quotient = Math.floorDiv(value, interval); + // We must not double-count values that divide to the same quotient since this returns doc + // counts as opposed to value co
[I] NRT replication should make it possible/easy to use bite-sized commits [lucene]
mikemccand opened a new issue, #14219: URL: https://github.com/apache/lucene/issues/14219 ### Description At Amazon (product search) we use Lucene's [awesome near-real-time segment replication](https://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html) to efficiently distribute index changes (through S3) to searchers. This "index once search many" approach works very well for applications needing high QPS scale. It enables physical isolation of indexing and searching (so e.g. heavy merges cannot affect searching). It enables rapid failover or proactive cluster load-balancing if an indexer crashes or its node is too hot: just promote one of the replicas. And it means you should frontload lots of indexing cost if it make searching even a bit faster. But recently we've been struggling with too-large checkpoints. We ask the indexer to write a new checkpoint (an `IndexWriter` commit call) every 60 seconds, and searchers copy down the checkpoint and light the new segments. During periods of heavy index updates ("update storms"), combined with our very aggressive `TieredMergePolicy` configuration to reclaim deletes, we see a big write amplification (bytes copied to replicas vs bytes written for newly indexed documents), sometimes sending many 10s of GB new segments in a single checkpoint. When replicas copy these large checkpoints, it can induce heavy page faults on the hot query path for in-flight queries (we suspect the [`MADV_RANDOM` hint for KNN files to also exacerbate things](https://github.com/apache/lucene/pull/13267) for us -- this is good for cold indices, but maybe not mostly hot ones?) since the hot searching pages evicted by copying in the large checkpoint before any of the new segments are lit puts RAM pressure on the OS. We could maybe tune the OS to more aggressively move dirty pages to disk? Or maybe try `O_DIRECT` when copying the new checkpoint files. But still when we then light the new segments, we'll hit page faults then too. We had an a-ha moment on how to fix this, using APIs Lucene already exposes! We just need to decouple committing from checkpointing/replicating. Instead of committing/replicating every 60 seconds, ask Lucene to commit much more frequently (say once per second, like OpenSearch/Elasticsearch default I think, or maybe "whenever > N GB segments turnover", though this is harder). Configure a time-based `IndexDeletionPolicy` so these commit points all live for a long time (say an hour). Then, every 60 seconds (or whatever your replication interval is), replicate all new commit points (and any segment files referenced by these commit points) out to searchers. The searchers can then carefully pick and choose which commit points they want to switch too, in a bite sized / stepping stone manner, ensuring that each commit point they light has < N GB turnover in the segments, meaning the OS will only ever need "hot-pages plus N" GB of working RAM. This leans nicely on [Lucene's strongly transactional APIs](https://blog.mikemccandless.com/2012/03/transactional-lucene.html), and I think it's largely sugar / utility classes in NRT replicator that we'd need to add to demonstrate this approach, maybe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Vector API to decode BKD docIds [lucene]
gf2121 commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2651208564 Thanks for feedback! I implement the fixed-size inner loop and print out assembly for all. [perf_asm.log](https://github.com/user-attachments/files/18752147/perf_asm.log) * When profiling enabled, `#current countVariable=true` and `#current countVariable=false` has same (slow) speed. It seems like profiling prevented some optimization. * According to the assembly, `#current bpv=16` does not get auto-vectorized. `#current bpv=24` gets vectorized on the shift loop, but not for the remainder loop. * According to the assembly, the innerloop get auto-vectorized, but slower than vector API. **MAC M2** ``` Benchmark(bpv) (countVariable) Mode CntScore Error Units BKDCodecBenchmark.current 16 true thrpt5 103.490 ± 6.785 ops/ms BKDCodecBenchmark.current 16false thrpt5 212.488 ± 5.383 ops/ms BKDCodecBenchmark.current 24 true thrpt5 91.203 ± 1.023 ops/ms BKDCodecBenchmark.current 24false thrpt5 149.742 ± 1.953 ops/ms BKDCodecBenchmark.currentVector 16 true thrpt5 213.162 ± 1.598 ops/ms BKDCodecBenchmark.currentVector 16false thrpt5 216.529 ± 2.518 ops/ms BKDCodecBenchmark.currentVector 24 true thrpt5 153.970 ± 1.101 ops/ms BKDCodecBenchmark.currentVector 24false thrpt5 140.103 ± 3.001 ops/ms BKDCodecBenchmark.innerLoop 16 true thrpt5 129.281 ± 0.471 ops/ms BKDCodecBenchmark.innerLoop 16false thrpt5 131.083 ± 8.775 ops/ms BKDCodecBenchmark.innerLoop 24 true thrpt5 99.597 ± 2.850 ops/ms BKDCodecBenchmark.innerLoop 24false thrpt5 96.235 ± 14.875 ops/ms BKDCodecBenchmark.legacy16 true thrpt5 104.314 ± 0.557 ops/ms BKDCodecBenchmark.legacy16false thrpt5 202.175 ± 10.863 ops/ms BKDCodecBenchmark.legacy24 true thrpt5 86.016 ± 1.315 ops/ms BKDCodecBenchmark.legacy24false thrpt5 85.609 ± 5.733 ops/ms ``` **Linux X86 AVX512 profiling disabled** ``` Benchmark(bpv) (countVariable) Mode CntScore Error Units BKDCodecBenchmark.current 16 true thrpt5 41.138 ± 1.770 ops/ms BKDCodecBenchmark.current 16false thrpt5 142.277 ± 0.943 ops/ms BKDCodecBenchmark.current 24 true thrpt5 43.104 ± 0.066 ops/ms BKDCodecBenchmark.current 24false thrpt5 42.760 ± 0.496 ops/ms BKDCodecBenchmark.currentVector 16 true thrpt5 86.565 ± 0.904 ops/ms BKDCodecBenchmark.currentVector 16false thrpt5 86.624 ± 0.395 ops/ms BKDCodecBenchmark.currentVector 24 true thrpt5 80.064 ± 2.604 ops/ms BKDCodecBenchmark.currentVector 24false thrpt5 76.638 ± 18.692 ops/ms BKDCodecBenchmark.innerLoop 16 true thrpt5 43.810 ± 1.096 ops/ms BKDCodecBenchmark.innerLoop 16false thrpt5 42.485 ± 0.073 ops/ms BKDCodecBenchmark.innerLoop 24 true thrpt5 37.255 ± 0.994 ops/ms BKDCodecBenchmark.innerLoop 24false thrpt5 37.243 ± 0.593 ops/ms BKDCodecBenchmark.legacy16 true thrpt5 41.415 ± 0.079 ops/ms BKDCodecBenchmark.legacy16false thrpt5 145.713 ± 0.381 ops/ms BKDCodecBenchmark.legacy24 true thrpt5 27.758 ± 4.210 ops/ms BKDCodecBenchmark.legacy24false thrpt5 28.519 ± 1.839 ops/ms ``` **Linux X86 AVX512 profiling enabled** ``` Benchmark(bpv) (countVariable) Mode Cnt Score Error Units BKDCodecBenchmark.current 16 true thrpt5 29.878 ± 0.130 ops/ms BKDCodecBenchmark.current:asm 16 true thrpt NaN --- BKDCodecBenchmark.current 16false thrpt5 29.314 ± 0.229 ops/ms BKDCodecBenchmark.current:asm 16false thrpt NaN --- BKDCodecBenchmark.current 24 true thrpt5 34.874 ± 0.320 ops/ms BKDCodecBenchmark.current:asm 24 true thrpt NaN --- BKDCodecBenchmark.current 24false thrpt5 33.987 ± 0.055
Re: [PR] Bugfix/fix hnsw search termination check [lucene]
tteofili commented on code in PR #14215: URL: https://github.com/apache/lucene/pull/14215#discussion_r1951068471 ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -52,7 +52,7 @@ public boolean collect(int docId, float similarity) { @Override public float minCompetitiveSimilarity() { -return queue.size() >= k() ? queue.topScore() : Float.NEGATIVE_INFINITY; +return queue.size() >= k() ? Math.nextUp(queue.topScore()) : Float.NEGATIVE_INFINITY; Review Comment: > I am fine with that as well. I suppose the only thing that actually uses this is the graph searcher :/ we now possibly have multiple class for graph searchers (extensions of `AbstractHnswGraphSearcher`), but much better than going over all of the `KnnCollectors` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Bugfix/fix hnsw search termination check [lucene]
tteofili commented on PR #14215: URL: https://github.com/apache/lucene/pull/14215#issuecomment-2651170577 only perhaps it'd be nice if we could add an @Monster / nightly test to make sure we don't run in this again in the future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]
Vikasht34 commented on issue #14208: URL: https://github.com/apache/lucene/issues/14208#issuecomment-2651658228 @benwtrent haha , It looks like from LLM , Originally it came from this paper https://www.arxiv.org/pdf/1908.00814v5 ``` The upper layer graph is merged into the approximate k-NN graph of the next layer after each round of J-Merge. During the construction process, one is allowed to save an arbitrary number of layers to form the hierarchy. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]
github-actions[bot] commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2652354185 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Binary vector format for flat and hnsw vectors [lucene]
tveasey commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2652753715 This pull request relates only to OSQ, and thus the proper scope of discussion is regarding the concerns raised around its attribution. We have pursued multiple conversations and discussions in order to resolve various concerns amicably, and we continue to desire collaboration and open discourse with respect to each party's innovations on this problem space. We would like to reiterate that we highly rate the research in both RaBitQ and its extension. However, we also maintain that OSQ is not a derivative of extended RaBitQ as we have done in private communications. > In addition, Elastic shared a doc with us explaining the differences between OSQ and extended RaBitQ, but our understanding is that the core ideas of two methods, i.e., trying different parameters for scalar quantization on a per-vector basis, are similar. This goes to crux of the disagreement around attribution. In the following we describe the two schemes: extended RaBitQ [1] and OSQ [2] and state again the points we have made privately regarding their differences. Both methods centre the data, which was originally proposed in LVQ [3] to the best of our knowledge. Extended RaBitQ proposes a method for constructing a codebook for which one can compute the dot product efficiently using integer arithmetic. In a nutshell it proceeds as follows: 1. Apply a random orthogonal transformation of the input vectors. 2. Normalise resulting vectors and perform an efficient search for the nearest codebook centroid. 3. Snap the floating point vector to this codebook centroid and then undo the normalization procedure to obtain an estimate of the original dot product. At its heart it is a variant of product quantization which allows for efficient similarity calculations and was positioned as such in the original paper. The key ingredients of OSQ are as follows: 1. Construct per vector quantization intervals, inherited from LVQ. 2. Formulate quantization as a hyperparameter optimization problem for the interval [a, b] used to construct the grid of possible quantized vectors, building on our earlier work which we published in April [4]. 3. Perform an analysis to show how to choose [a, b] to minimize the expected square error in the dot product when the data are normally distributed using sufficient statistics of the vector component distribution. (We perform an empirical study of the distribution of vector components in embedding models and show they are typically normally distributed.) 4. Perform an iterative procedure based on coordinate descent which optimizes an anisotropic loss based on the ideas in [5] w.r.t. the interval [a, b] initialising with the output of 3. At an algorithmic level there is no similarity between the methods other than they both centre the data. At a conceptual level RaBitQ was developed as a variant of product quantization and OSQ was developed as an approach to hyperparameter optimization for per vector scalar quantization. For the record, we were inspired to revisit scalar quantization in light of the success of binary RaBitQ and we were exploring ideas prior to any communication between our teams regarding extended RaBitQ. The actual inspiration for the approach was our previous work on hyperparameter optimisation for quantization intervals and LVQ and we attribute this PR accordingly. Retrospectively, and in parallel to our work the authors of RaBitQ explored the relationship between extended RaBitQ and scalar quantization directly ([6] published 12/12/2024). They show that the act of normalising the vectors, finding the nearest centroid then undoing the normalisation can be thought of as a process of rescaling the quantization grid so that the nearest quantized vector winds up closer to the original floating point vector. Whilst this bears some relation to optimising b - a they never formulate this process as an optimisation problem and at no point show exactly what quantity they in fact optimise. In our view, this misses the key point that the important idea is the formulation. As such it is very clear that they do not and could not introduce any different optimisation objective, such as anisotopic loss in the dot product. Even if they did, OSQ optimises both a and b directly which gives it an extra degree of freedom and develops an effective and non-trivial and completely distinct solution for this problem. As we have communicated privately, we dispute that OSQ is a variant of extended RaBitQ. We laid out in broadly similar terms exactly why we disagree at a technical level (along the lines of the discussion in this comment). We have received no further technical engagement with the points we have raised and have therefore reached an impasse. Regarding this point
Re: [PR] Add a Faiss codec for KNN searches [lucene]
navneet1v commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1952041592 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,457 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.codecs.faiss; + +import static java.lang.foreign.ValueLayout.ADDRESS; +import static java.lang.foreign.ValueLayout.JAVA_FLOAT; +import static java.lang.foreign.ValueLayout.JAVA_INT; +import static java.lang.foreign.ValueLayout.JAVA_LONG; + +import java.io.IOException; +import java.lang.foreign.Arena; +import java.lang.foreign.FunctionDescriptor; +import java.lang.foreign.Linker; +import java.lang.foreign.MemoryLayout; +import java.lang.foreign.MemorySegment; +import java.lang.foreign.SymbolLookup; +import java.lang.invoke.MethodHandle; +import java.lang.invoke.MethodHandles; +import java.lang.invoke.MethodType; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.LongBuffer; +import java.util.Locale; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.KnnVectorValues; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.KnnCollector; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.hnsw.IntToIntFunction; + +public final class LibFaissC { + /* + * TODO: Requires some changes to Faiss + * - https://github.com/facebookresearch/faiss/pull/4158 (merged in main, to be released in v1.11.0) + * - https://github.com/facebookresearch/faiss/pull/4167 (merged in main, to be released in v1.11.0) + * - https://github.com/facebookresearch/faiss/pull/4180 (in progress) + */ + + public static final String LIBRARY_NAME = "faiss_c"; + public static final String LIBRARY_VERSION = "1.10.0"; + + static { +try { + System.loadLibrary(LIBRARY_NAME); +} catch (UnsatisfiedLinkError e) { + throw new RuntimeException( + "Shared library not found, build the Faiss C_API from https://github.com/facebookresearch/faiss/blob/main/c_api/INSTALL.md " + + "and link it (along with all dependencies) to the library path " + + "(-Djava.library.path JVM argument or $LD_LIBRARY_PATH environment variable)", + e); +} +checkLibraryVersion(); + } + + private LibFaissC() {} + + @SuppressWarnings("SameParameterValue") + private static MemorySegment getUpcallStub( + Arena arena, MethodHandle target, MemoryLayout resLayout, MemoryLayout... argLayouts) { +return Linker.nativeLinker() +.upcallStub(target, FunctionDescriptor.of(resLayout, argLayouts), arena); + } + + private static MethodHandle getDowncallHandle( + String functionName, MemoryLayout resLayout, MemoryLayout... argLayouts) { +return Linker.nativeLinker() +.downcallHandle( +SymbolLookup.loaderLookup().find(functionName).orElseThrow(), +FunctionDescriptor.of(resLayout, argLayouts)); + } + + private static void checkLibraryVersion() { +MethodHandle getVersion = getDowncallHandle("faiss_get_version", ADDRESS); +String actualVersion = callAndGetString(getVersion); +if (LIBRARY_VERSION.equals(actualVersion) == false) { + throw new UnsupportedOperationException( + String.format( + Locale.ROOT, + "Expected Faiss library version %s, found %s", + LIBRARY_VERSION, + actualVersion)); +} + } + + private static final MethodHandle FREE_INDEX = + getDowncallHandle("faiss_Index_free", JAVA_INT, ADDRESS); + + public static void freeIndex(MemorySegment indexPointer) { +callAndHandleError(FREE_INDEX, indexPointer); + } + + private static final MethodHandle FREE_CUSTOM_IO_WRITER = + getDowncallHandle("faiss_CustomIOWriter_free", JAVA_INT, ADDRESS); + + public static void freeCustomIOWriter(MemorySegment customIOWriterPointer) { +callAndHandleError(FREE_CUSTOM_IO_WRITER, customIOWriterPointer); + } + + private static final MethodHandle FREE_CUST
Re: [PR] Add a Faiss codec for KNN searches [lucene]
navneet1v commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1952038874 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,457 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.codecs.faiss; + +import static java.lang.foreign.ValueLayout.ADDRESS; +import static java.lang.foreign.ValueLayout.JAVA_FLOAT; +import static java.lang.foreign.ValueLayout.JAVA_INT; +import static java.lang.foreign.ValueLayout.JAVA_LONG; + +import java.io.IOException; +import java.lang.foreign.Arena; +import java.lang.foreign.FunctionDescriptor; +import java.lang.foreign.Linker; +import java.lang.foreign.MemoryLayout; +import java.lang.foreign.MemorySegment; +import java.lang.foreign.SymbolLookup; +import java.lang.invoke.MethodHandle; +import java.lang.invoke.MethodHandles; +import java.lang.invoke.MethodType; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.LongBuffer; +import java.util.Locale; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.KnnVectorValues; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.KnnCollector; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.hnsw.IntToIntFunction; + +public final class LibFaissC { + /* + * TODO: Requires some changes to Faiss + * - https://github.com/facebookresearch/faiss/pull/4158 (merged in main, to be released in v1.11.0) + * - https://github.com/facebookresearch/faiss/pull/4167 (merged in main, to be released in v1.11.0) + * - https://github.com/facebookresearch/faiss/pull/4180 (in progress) + */ + + public static final String LIBRARY_NAME = "faiss_c"; + public static final String LIBRARY_VERSION = "1.10.0"; + + static { +try { + System.loadLibrary(LIBRARY_NAME); +} catch (UnsatisfiedLinkError e) { + throw new RuntimeException( + "Shared library not found, build the Faiss C_API from https://github.com/facebookresearch/faiss/blob/main/c_api/INSTALL.md " + + "and link it (along with all dependencies) to the library path " + + "(-Djava.library.path JVM argument or $LD_LIBRARY_PATH environment variable)", + e); +} +checkLibraryVersion(); + } + + private LibFaissC() {} + + @SuppressWarnings("SameParameterValue") + private static MemorySegment getUpcallStub( + Arena arena, MethodHandle target, MemoryLayout resLayout, MemoryLayout... argLayouts) { +return Linker.nativeLinker() +.upcallStub(target, FunctionDescriptor.of(resLayout, argLayouts), arena); + } + + private static MethodHandle getDowncallHandle( + String functionName, MemoryLayout resLayout, MemoryLayout... argLayouts) { +return Linker.nativeLinker() +.downcallHandle( +SymbolLookup.loaderLookup().find(functionName).orElseThrow(), +FunctionDescriptor.of(resLayout, argLayouts)); + } + + private static void checkLibraryVersion() { +MethodHandle getVersion = getDowncallHandle("faiss_get_version", ADDRESS); +String actualVersion = callAndGetString(getVersion); +if (LIBRARY_VERSION.equals(actualVersion) == false) { + throw new UnsupportedOperationException( + String.format( + Locale.ROOT, + "Expected Faiss library version %s, found %s", + LIBRARY_VERSION, + actualVersion)); +} + } + + private static final MethodHandle FREE_INDEX = + getDowncallHandle("faiss_Index_free", JAVA_INT, ADDRESS); + + public static void freeIndex(MemorySegment indexPointer) { +callAndHandleError(FREE_INDEX, indexPointer); + } + + private static final MethodHandle FREE_CUSTOM_IO_WRITER = + getDowncallHandle("faiss_CustomIOWriter_free", JAVA_INT, ADDRESS); + + public static void freeCustomIOWriter(MemorySegment customIOWriterPointer) { +callAndHandleError(FREE_CUSTOM_IO_WRITER, customIOWriterPointer); + } + + private static final MethodHandle FREE_CUST
Re: [PR] Add a Faiss codec for KNN searches [lucene]
navneet1v commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1952063404 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,457 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.codecs.faiss; + +import static java.lang.foreign.ValueLayout.ADDRESS; +import static java.lang.foreign.ValueLayout.JAVA_FLOAT; +import static java.lang.foreign.ValueLayout.JAVA_INT; +import static java.lang.foreign.ValueLayout.JAVA_LONG; + +import java.io.IOException; +import java.lang.foreign.Arena; +import java.lang.foreign.FunctionDescriptor; +import java.lang.foreign.Linker; +import java.lang.foreign.MemoryLayout; +import java.lang.foreign.MemorySegment; +import java.lang.foreign.SymbolLookup; +import java.lang.invoke.MethodHandle; +import java.lang.invoke.MethodHandles; +import java.lang.invoke.MethodType; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.LongBuffer; +import java.util.Locale; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.KnnVectorValues; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.KnnCollector; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.hnsw.IntToIntFunction; + +public final class LibFaissC { + /* + * TODO: Requires some changes to Faiss + * - https://github.com/facebookresearch/faiss/pull/4158 (merged in main, to be released in v1.11.0) + * - https://github.com/facebookresearch/faiss/pull/4167 (merged in main, to be released in v1.11.0) + * - https://github.com/facebookresearch/faiss/pull/4180 (in progress) + */ + + public static final String LIBRARY_NAME = "faiss_c"; + public static final String LIBRARY_VERSION = "1.10.0"; + + static { +try { + System.loadLibrary(LIBRARY_NAME); +} catch (UnsatisfiedLinkError e) { + throw new RuntimeException( + "Shared library not found, build the Faiss C_API from https://github.com/facebookresearch/faiss/blob/main/c_api/INSTALL.md " + + "and link it (along with all dependencies) to the library path " + + "(-Djava.library.path JVM argument or $LD_LIBRARY_PATH environment variable)", + e); +} +checkLibraryVersion(); + } + + private LibFaissC() {} + + @SuppressWarnings("SameParameterValue") + private static MemorySegment getUpcallStub( + Arena arena, MethodHandle target, MemoryLayout resLayout, MemoryLayout... argLayouts) { +return Linker.nativeLinker() +.upcallStub(target, FunctionDescriptor.of(resLayout, argLayouts), arena); + } + + private static MethodHandle getDowncallHandle( + String functionName, MemoryLayout resLayout, MemoryLayout... argLayouts) { +return Linker.nativeLinker() +.downcallHandle( +SymbolLookup.loaderLookup().find(functionName).orElseThrow(), +FunctionDescriptor.of(resLayout, argLayouts)); + } + + private static void checkLibraryVersion() { +MethodHandle getVersion = getDowncallHandle("faiss_get_version", ADDRESS); +String actualVersion = callAndGetString(getVersion); +if (LIBRARY_VERSION.equals(actualVersion) == false) { + throw new UnsupportedOperationException( + String.format( + Locale.ROOT, + "Expected Faiss library version %s, found %s", + LIBRARY_VERSION, + actualVersion)); +} + } + + private static final MethodHandle FREE_INDEX = + getDowncallHandle("faiss_Index_free", JAVA_INT, ADDRESS); + + public static void freeIndex(MemorySegment indexPointer) { +callAndHandleError(FREE_INDEX, indexPointer); + } + + private static final MethodHandle FREE_CUSTOM_IO_WRITER = + getDowncallHandle("faiss_CustomIOWriter_free", JAVA_INT, ADDRESS); + + public static void freeCustomIOWriter(MemorySegment customIOWriterPointer) { +callAndHandleError(FREE_CUSTOM_IO_WRITER, customIOWriterPointer); + } + + private static final MethodHandle FREE_CUST
Re: [PR] Add a Faiss codec for KNN searches [lucene]
kaivalnp commented on code in PR #14178: URL: https://github.com/apache/lucene/pull/14178#discussion_r1952061362 ## lucene/sandbox/src/java22/org/apache/lucene/sandbox/codecs/faiss/LibFaissC.java: ## @@ -0,0 +1,457 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.sandbox.codecs.faiss; + +import static java.lang.foreign.ValueLayout.ADDRESS; +import static java.lang.foreign.ValueLayout.JAVA_FLOAT; +import static java.lang.foreign.ValueLayout.JAVA_INT; +import static java.lang.foreign.ValueLayout.JAVA_LONG; + +import java.io.IOException; +import java.lang.foreign.Arena; +import java.lang.foreign.FunctionDescriptor; +import java.lang.foreign.Linker; +import java.lang.foreign.MemoryLayout; +import java.lang.foreign.MemorySegment; +import java.lang.foreign.SymbolLookup; +import java.lang.invoke.MethodHandle; +import java.lang.invoke.MethodHandles; +import java.lang.invoke.MethodType; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.LongBuffer; +import java.util.Locale; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.KnnVectorValues; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.search.KnnCollector; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.hnsw.IntToIntFunction; + +public final class LibFaissC { + /* + * TODO: Requires some changes to Faiss + * - https://github.com/facebookresearch/faiss/pull/4158 (merged in main, to be released in v1.11.0) + * - https://github.com/facebookresearch/faiss/pull/4167 (merged in main, to be released in v1.11.0) + * - https://github.com/facebookresearch/faiss/pull/4180 (in progress) + */ + + public static final String LIBRARY_NAME = "faiss_c"; + public static final String LIBRARY_VERSION = "1.10.0"; + + static { +try { + System.loadLibrary(LIBRARY_NAME); +} catch (UnsatisfiedLinkError e) { + throw new RuntimeException( + "Shared library not found, build the Faiss C_API from https://github.com/facebookresearch/faiss/blob/main/c_api/INSTALL.md " + + "and link it (along with all dependencies) to the library path " + + "(-Djava.library.path JVM argument or $LD_LIBRARY_PATH environment variable)", + e); +} +checkLibraryVersion(); + } + + private LibFaissC() {} + + @SuppressWarnings("SameParameterValue") + private static MemorySegment getUpcallStub( + Arena arena, MethodHandle target, MemoryLayout resLayout, MemoryLayout... argLayouts) { +return Linker.nativeLinker() +.upcallStub(target, FunctionDescriptor.of(resLayout, argLayouts), arena); + } + + private static MethodHandle getDowncallHandle( + String functionName, MemoryLayout resLayout, MemoryLayout... argLayouts) { +return Linker.nativeLinker() +.downcallHandle( +SymbolLookup.loaderLookup().find(functionName).orElseThrow(), +FunctionDescriptor.of(resLayout, argLayouts)); + } + + private static void checkLibraryVersion() { +MethodHandle getVersion = getDowncallHandle("faiss_get_version", ADDRESS); +String actualVersion = callAndGetString(getVersion); +if (LIBRARY_VERSION.equals(actualVersion) == false) { + throw new UnsupportedOperationException( + String.format( + Locale.ROOT, + "Expected Faiss library version %s, found %s", + LIBRARY_VERSION, + actualVersion)); +} + } + + private static final MethodHandle FREE_INDEX = + getDowncallHandle("faiss_Index_free", JAVA_INT, ADDRESS); + + public static void freeIndex(MemorySegment indexPointer) { +callAndHandleError(FREE_INDEX, indexPointer); + } + + private static final MethodHandle FREE_CUSTOM_IO_WRITER = + getDowncallHandle("faiss_CustomIOWriter_free", JAVA_INT, ADDRESS); + + public static void freeCustomIOWriter(MemorySegment customIOWriterPointer) { +callAndHandleError(FREE_CUSTOM_IO_WRITER, customIOWriterPointer); + } + + private static final MethodHandle FREE_CUSTO
Re: [PR] Use Vector API to decode BKD docIds [lucene]
gf2121 commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2652803442 > applied the 0xFF mask to scratch in the shift loop This helps generate `vpand` in assembly, but not help performance too much. > Sorry for pushing Not at all, it's interesting & exciting to play with how loops get auto-vectorized! > if we could get auto-vectorization to do the right thing, then this would automatically benefit all users, not only those who enable the vector module. +1 I try to minimize the code to see when loop will be vectorized. I write a simple `ShiftMaskBenchmark`(I add it to this patch temporarily to show code). It tells me that JIT can not vectorize loop when the offset is not predictable. This might explain why remainder decode / bpv16 decoding can not be vectorized. asm: [perf_asm.log](https://github.com/user-attachments/files/18763358/perf_asm.log) ``` java -version openjdk version "23.0.2" 2025-01-21 OpenJDK Runtime Environment (build 23.0.2+7-58) OpenJDK 64-Bit Server VM (build 23.0.2+7-58, mixed mode, sharing) Mac M2 Benchmark Mode Cnt Score Error Units ShiftMaskBenchmark.fixOffset thrpt5 9131.716 ± 49.914 ops/ms ShiftMaskBenchmark.varOffset thrpt5 3019.323 ± 120.261 ops/ms Linux AVX512 Benchmark Mode Cnt ScoreError Units ShiftMaskBenchmark.fixOffset thrpt5 5021.907 ? 13.400 ops/ms ShiftMaskBenchmark.fixOffset:asm thrptNaN --- ShiftMaskBenchmark.varOffset thrpt5 1115.164 ? 0.958 ops/ms ShiftMaskBenchmark.varOffset:asm thrptNaN --- ``` InnerLoop seems like a good option if the conclusion is true. I'd try to see if i can improve it a bit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use Vector API to decode BKD docIds [lucene]
jpountz commented on PR #14203: URL: https://github.com/apache/lucene/pull/14203#issuecomment-2652267868 > #current bpv=24 gets vectorized on the shift loop, but not for the remainder loop. This is an interesting observation. I wonder if a small refactoring could help it get auto-vectorized? E.g. what if we applied the `0xFF` mask to `scratch` in the shift loop rather than the remainder loop? Or if we split the remainder loop into 3 loops, one for each 8 bits that get contributed to the value? Sorry for pushing, but if we could get auto-vectorization to do the right thing, then this would automatically benefit all users, not only those who enable the vector module. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add histogram facet capabilities. [lucene]
gsmiller commented on PR #14204: URL: https://github.com/apache/lucene/pull/14204#issuecomment-2651798767 This looks good to me @jpountz. I think it makes sense to put this in sandbox, but I'd personally also be fine with leaving it where you initially had it. (I think this also highlights that it would be nice to get the sandbox faceting module out of sandbox and reconciled with the traditional faceting module sooner-rather-than later). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] TestForTooMuchCloning.test fails [lucene]
benwtrent opened a new issue, #14220: URL: https://github.com/apache/lucene/issues/14220 ### Description I have seen this fail in periodic CI builds and on some PR builds. ``` TestForTooMuchCloning > test FAILED -- | java.lang.AssertionError: too many calls to IndexInput.clone during merging: 519 | at __randomizedtesting.SeedInfo.seed([274680B90E738EFA:AF12BF63A08FE302]:0) | at org.junit.Assert.fail(Assert.java:89) | at org.junit.Assert.assertTrue(Assert.java:42) | at org.apache.lucene.index.TestForTooMuchCloning.test(TestForTooMuchCloning.java:63) | at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) | at java.base/java.lang.reflect.Method.invoke(Method.java:580) ``` Or: ``` TestForTooMuchCloning > test FAILED java.lang.AssertionError: too many calls to IndexInput.clone during merging: 503 at __randomizedtesting.SeedInfo.seed([FB5D79942E80A9F0:7309464E807CC408]:0) at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.assertTrue(Assert.java:42) at org.apache.lucene.index.TestForTooMuchCloning.test(TestForTooMuchCloning.java:63) ``` ### Gradle command to reproduce ``` ./gradlew test --tests TestForTooMuchCloning.test -Dtests.seed=FB5D79942E80A9F0 -Dtests.locale=sl-Latn-SI -Dtests.timezone=America/Mexico_City -Dtests.asserts=true -Dtests.file.encoding=UTF-8 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Correct hashCode of SynonymQuery [lucene]
mayya-sharipova merged PR #14217: URL: https://github.com/apache/lucene/pull/14217 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]
benwtrent commented on issue #14208: URL: https://github.com/apache/lucene/issues/14208#issuecomment-2651681430 Ah, ok :) no offense meant. It just struck me, while they are ideas that might be worth exploring, they were a little too general (what I have seen from LLMs in the past). If we could merge portions of the graph together at a time (e.g. bootstrapping from clusters), we could reduce memory and improve performance. It's just tricky to get right and nobody has tackled it yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]
Vikasht34 commented on issue #14208: URL: https://github.com/apache/lucene/issues/14208#issuecomment-2651696435 @benwtrent Should we create separate issue and I can take a stab , implementing it with this paper https://www.arxiv.org/pdf/1908.00814v5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Clean up public non-final statics [lucene]
msfroh opened a new pull request, #14221: URL: https://github.com/apache/lucene/pull/14221 ### Description Following up on https://github.com/apache/lucene/issues/14151 and https://github.com/apache/lucene/issues/14152, I decided to grep for other `public static` non-`final` variables in the Lucene codebase. This change makes `static`s `final` where it is both possible and seems appropriate. (That is, the variables really do seem to be constants.) I made a couple `private` instead, since they are modified, but privately. I also updated some old-school code where IntelliJ flagged opportunities to use newer Java idioms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Bugfix/fix hnsw search termination check [lucene]
benwtrent merged PR #14215: URL: https://github.com/apache/lucene/pull/14215 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make HNSW merges cheaper on heap or at least expose heap usage estimate [lucene]
benwtrent commented on issue #14208: URL: https://github.com/apache/lucene/issues/14208#issuecomment-2651707316 @Vikasht34 there is already a general "make hnsw merges faster" issue, so I think you can just refer to it. I look forward to the results! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestForTooMuchCloning.test fails [lucene]
benwtrent commented on issue #14220: URL: https://github.com/apache/lucene/issues/14220#issuecomment-2651975052 Took a bit to find the first commit where this fails. It sometimes succeeds, so gitbisect was tricky. The first commit I could find is this one: It fails pretty reliably with this seed on this commit. https://github.com/apache/lucene/commit/d2c69c1472c7af87d7e619dfcabe207a4a614c20 The commit before this one, I ran 1000s of tests and it never failed with this seed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Consistent KNN query results with multiple leafs [lucene]
benwtrent commented on PR #14191: URL: https://github.com/apache/lucene/pull/14191#issuecomment-2652285114 OK, I ran a slightly modified version of this: https://github.com/apache/lucene/compare/main...benwtrent:lucene:feature/consistent-sharing-knn?expand=1 I indexed 8M docs with Elasticsearch's default merging. This resulted in multiple different tiers and many segments. The store shows its better than no sharing at all (2x), but not nearly as good (from a visited standpoint) as the current sharing model (between all segments). ## No sharing: ``` recall latency(ms) nDoc topK fanout visited num segments selectivity 1.000 213.400 800 100 100 412280 128 1.000 ``` ## Current Sharing: ### Single threaded: ``` recall latency(ms) nDoc topK fanout visited num segments selectivity 0.949 59.900 800 100 10087299 128 1.000 ``` ### 8 Threads (multiple runs) ``` recall latency(ms) nDoc topK fanout visited num segments selectivity 0.967 12.000 800 100 10084477 128 1.000 0.967 10.100 800 100 10084568 128 1.000 0.949 12.300 800 100 10085034 128 1.000 0.949 13.300 800 100 10084267 128 1.000 0.967 11.700 800 100 10086035 128 1.000 0.949 12.000 800 100 10084993 128 1.000 0.949 12.700 800 100 10085139 128 1.000 0.967 10.000 800 100 10085590 128 1.000 0.949 10.200 800 100 10085106 128 1.000 0.967 12.600 800 100 10085817 128 1.000 ``` ## New Sharing: ### Single thread: ``` recall latency(ms) nDoc topK fanout visited num segments selectivity 0.957 119.500 800 100 100 193402 128 1.000 ``` ### 8 Threads (multiple runs) ``` recall latency(ms) nDoc topK fanout visited num segments selectivity 0.957 16.400 800 100 100 193402 128 1.000 0.957 14.600 800 100 100 193402 128 1.000 0.957 16.800 800 100 100 193402 128 1.000 0.957 16.400 800 100 100 193402 128 1.000 0.957 16.700 800 100 100 193402 128 1.000 0.957 16.000 800 100 100 193402 128 1.000 0.957 16.200 800 100 100 193402 128 1.000 0.957 14.400 800 100 100 193402 128 1.000 0.957 19.500 800 100 100 193402 128 1.000 0.957 18.600 800 100 100 193402 128 1.000 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Refactor QueryCache to improve concurrency and performance [lucene]
sgup432 opened a new issue, #14222: URL: https://github.com/apache/lucene/issues/14222 ### Description Given the significant role of LRUQueryCache in Lucene, I see opportunities to enhance its performance. Although there have been many discussions on this topic like [here](https://github.com/apache/lucene/issues/10081), they haven’t led to anywhere. Currently, QueryCache uses a `Map`, where: - **CacheKey** is tied to a specific segment. - **LeafCache** is per-segment and maintains a Map, where CacheAndCount is essentially a docID set. ### Current Issue: We use a global read/write lock to put items into the cache. However, since CacheKey is our primary key, this limits concurrency. For example, if there are only 10 segments but 10,000 unique queries, we end up with just 10 unique cache keys. This means a large number of requests compete for the same write lock, significantly degrading performance by either waiting or not using cache (by skipping it if lock not obtained) ### One simple Solution: - **Introduce a Composite Key** - Instead of CacheKey alone, use (CacheKey, Query) as the key. - The new structure would be `Map` - **Partition the Cache for Better Concurrency** - Create multiple LRU partitions (e.g., 8, 16, or 64 , configurable). - Each partition has its own read/write lock to reduce contention. - Assign incoming requests to specific partition using: `hash(CacheKey, Query) % partition_count` (as an example) - I also see we use uniqueQueries/mostRecentlyUsedQueries [here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L94-L98), to maintain a LRU list for the queries. I think we can remove it and instead maintain a separate LRU list per partition, using CompositeKey. - For Eviction & Cleanup, we can collect stale `CacheKey` entries in a `Set` when segment is closed due to merge. A background thread later iterates over cache entries and clears them, similar to how RequestCache in OpenSearch works. Let me know what you think. I think it might work pretty well(and simple enough) but I am open to other ideas. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org