Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]
vigyasharma commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2439876776 Thanks @benwtrent. I've been working on getting a multi-vector benchmark running to wire this end to end. Found some pesky bugs and oversights. I'm planning to split this feature into multiple smaller PRs. This PR was mainly to get inputs on the approach. It's too big to test and review. I'll share a plan of the split PRs soon. re: the multi-vector benchmark for passage search use-case, I've been stuck on a bug where after I run into an `EOFException` on reading the last multi-vector document through `DenseOffHeapMultiVectorValues`. I could definitely use some help here. If you plan to take a look, you can use the code in this PR (i'll push my fixes) and multi-vector benchmark code from [here](https://github.com/vigyasharma/luceneutil/tree/multivec). ```java Exception in thread "main" java.lang.RuntimeException: java.io.EOFException: read past EOF: MemorySegmentIndexInput(path="/Users/vigyas/forks/bench/util/knnIndices/cohere-wikipedia-docs-768d.vec-32-50-multiVector.index/_0_Lucene99HnswMultiVectorsFormat_0.vecmv") [slice=multi-vector-data] at knn.KnnGraphTester$ComputeBaselineNNFloatTask.call(KnnGraphTester.java:1115) at knn.KnnGraphTester.computeNN(KnnGraphTester.java:967) at knn.KnnGraphTester.getNN(KnnGraphTester.java:812) at knn.KnnGraphTester.run(KnnGraphTester.java:438) at knn.KnnGraphTester.runWithCleanUp(KnnGraphTester.java:177) at knn.KnnGraphTester.main(KnnGraphTester.java:172) Caused by: java.io.EOFException: read past EOF: MemorySegmentIndexInput(path="/Users/vigyas/forks/bench/util/knnIndices/cohere-wikipedia-docs-768d.vec-32-50-multiVector.index/_0_Lucene99HnswMultiVectorsFormat_0.vecmv") [slice=multi-vector-data] at org.apache.lucene.store.MemorySegmentIndexInput.readByte(MemorySegmentIndexInput.java:146) at org.apache.lucene.store.DataInput.readInt(DataInput.java:95) at org.apache.lucene.store.MemorySegmentIndexInput.readInt(MemorySegmentIndexInput.java:261) at org.apache.lucene.store.DataInput.readFloats(DataInput.java:202) at org.apache.lucene.store.MemorySegmentIndexInput.readFloats(MemorySegmentIndexInput.java:231) at org.apache.lucene.codecs.lucene99.OffHeapFloatMultiVectorValues.vectorValue(OffHeapFloatMultiVectorValues.java:111) at org.apache.lucene.codecs.lucene99.OffHeapFloatMultiVectorValues.vectorValue(OffHeapFloatMultiVectorValues.java:130) at org.apache.lucene.codecs.hnsw.DefaultFlatMultiVectorScorer$FloatMultiVectorScorer.score(DefaultFlatMultiVectorScorer.java:185) at org.apache.lucene.codecs.lucene99.OffHeapFloatMultiVectorValues$DenseOffHeapMultiVectorValues$1.score(OffHeapFloatMultiVectorValues.java:248) at org.apache.lucene.search.AbstractKnnVectorQuery.exactSearch(AbstractKnnVectorQuery.java:220) at knn.KnnFloatVectorBenchmarkQuery.exactSearch(KnnFloatVectorBenchmarkQuery.java:33) at knn.KnnFloatVectorBenchmarkQuery.runExactSearch(KnnFloatVectorBenchmarkQuery.java:50) at knn.KnnGraphTester$ComputeBaselineNNFloatTask.call(KnnGraphTester.java:) ... 5 more ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]
zkendall commented on PR #2686: URL: https://github.com/apache/lucene-solr/pull/2686#issuecomment-2439871664 Closing in favor of PR to solr repo: https://github.com/apache/solr/pull/2799 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]
zkendall closed pull request #2686: [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available URL: https://github.com/apache/lucene-solr/pull/2686 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]
stefanvodita commented on code in PR #13914: URL: https://github.com/apache/lucene/pull/13914#discussion_r1817832373 ## lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java: ## @@ -202,66 +208,83 @@ public SegmentOutput(int hitsLength) { * is used to compute the equi-weight per bin. */ public static List computeDynamicNumericRanges( - long[] values, long[] weights, int len, long totalWeight, int topN) { + long[] values, long[] weights, int len, long totalValue, long totalWeight, int topN) { assert values.length == weights.length && len <= values.length && len >= 0; assert topN >= 0; List dynamicRangeResult = new ArrayList<>(); if (len == 0 || topN == 0) { return dynamicRangeResult; } -new InPlaceMergeSorter() { - @Override - protected int compare(int index1, int index2) { -int cmp = Long.compare(values[index1], values[index2]); -if (cmp == 0) { - // If the values are equal, sort based on the weights. - // Any weight order is correct as long as it's deterministic. - return Long.compare(weights[index1], weights[index2]); -} -return cmp; - } +double rangeWeightTarget = (double) totalWeight / topN; +double[] kWeights = new double[topN]; +for (int i = 0; i < topN; i++) { + kWeights[i] = (i == 0 ? 0 : kWeights[i - 1]) + rangeWeightTarget; Review Comment: I thought maybe you wanted to avoid the multiplications 😄 Which would be fair, my guess is the second one is faster because we're only doing sums and referencing values in the array that are cached. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]
stefanvodita commented on code in PR #13914: URL: https://github.com/apache/lucene/pull/13914#discussion_r1817835241 ## lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java: ## @@ -0,0 +1,407 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util; + +import java.util.Arrays; +import java.util.Comparator; +import java.util.SplittableRandom; + +/** + * Adaptive selection algorithm based on the introspective quick select algorithm. The quick select + * algorithm uses an interpolation variant of Tukey's ninther median-of-medians for pivot, and + * Bentley-McIlroy 3-way partitioning. For the introspective protection, it shuffles the sub-range + * if the max recursive depth is exceeded. + * + * This selection algorithm is fast on most data shapes, especially on nearly sorted data, or + * when k is close to the boundaries. It runs in linear time on average. + * + * @lucene.internal + */ +public abstract class WeightedSelector { + + // This selector is used repeatedly by the radix selector for sub-ranges of less than + // 100 entries. This means this selector is also optimized to be fast on small ranges. + // It uses the variant of medians-of-medians and 3-way partitioning, and finishes the + // last tiny range (3 entries or less) with a very specialized sort. + + private SplittableRandom random; + + protected abstract long getWeight(int i); + + protected abstract long getValue(int i); + + public final WeightRangeInfo[] select( + int from, + int to, + long rangeTotalValue, + long beforeTotalValue, + long rangeWeight, + long beforeWeight, + double[] kWeights) { Review Comment: Does it make sense to replace `k` with `quantile` maybe? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove LeafSimScorer abstraction. [lucene]
jpountz merged PR #13957: URL: https://github.com/apache/lucene/pull/13957 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Absolutely horrible Lucene performance with JDK 23 (Lucene 9.11.1 and 10.0.0) [lucene]
jpountz commented on issue #13959: URL: https://github.com/apache/lucene/issues/13959#issuecomment-2439549339 Is there also no regression if you use the default garbage collector? If so, this looks like a regression with Shenandoah. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Absolutely horrible Lucene performance with JDK 23 (Lucene 9.11.1 and 10.0.0) [lucene]
derreisende77 commented on issue #13959: URL: https://github.com/apache/lucene/issues/13959#issuecomment-2439550280 I have no problem with default GC as well. I downloaded a Windows Amazon Corretto 23 nightly build from today and I don't have a problem with Shenandoah anymore. So I guess it is a Shenandoah problem in JDK 23 and 23.0.1 that will be fixed in a later 23 release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Absolutely horrible Lucene performance with JDK 23 (Lucene 9.11.1 and 10.0.0) [lucene]
derreisende77 commented on issue #13959: URL: https://github.com/apache/lucene/issues/13959#issuecomment-2439422246 @benwtrent I made several tests on macOS with JDK 23 and 23.0.1 from Liberica and Azul. I always ran into the performance problem. I switched from Shenandoah GC to ZGC with ``` -XX:+UseZGC -XX:+ZGenerational -XX:ZUncommitDelay=5 -XX:+ZUncommit -XX:SoftMaxHeapSize=4g ``` and so far was unable to trigger the performance problem. Lucene behaves as performant as before with JDK 23 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Absolutely horrible Lucene performance with JDK 23 (Lucene 9.11.1 and 10.0.0) [lucene]
derreisende77 closed issue #13959: Absolutely horrible Lucene performance with JDK 23 (Lucene 9.11.1 and 10.0.0) URL: https://github.com/apache/lucene/issues/13959 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org