Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]

2024-10-26 Thread via GitHub


vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2439876776

   Thanks @benwtrent. I've been working on getting a multi-vector benchmark 
running to wire this end to end. Found some pesky bugs and oversights. I'm 
planning to split this feature into multiple smaller PRs. This PR was mainly to 
get inputs on the approach. It's too big to test and review. I'll share a plan 
of the split PRs soon.
   
   re: the multi-vector benchmark for passage search use-case, I've been stuck 
on a bug where after I run into an `EOFException` on reading the last 
multi-vector document through `DenseOffHeapMultiVectorValues`. I could 
definitely use some help here. If you plan to take a look, you can use the code 
in this PR (i'll push my fixes) and multi-vector benchmark code from 
[here](https://github.com/vigyasharma/luceneutil/tree/multivec).
   
   ```java
   Exception in thread "main" java.lang.RuntimeException: java.io.EOFException: 
read past EOF: 
MemorySegmentIndexInput(path="/Users/vigyas/forks/bench/util/knnIndices/cohere-wikipedia-docs-768d.vec-32-50-multiVector.index/_0_Lucene99HnswMultiVectorsFormat_0.vecmv")
 [slice=multi-vector-data]
   at 
knn.KnnGraphTester$ComputeBaselineNNFloatTask.call(KnnGraphTester.java:1115)
   at knn.KnnGraphTester.computeNN(KnnGraphTester.java:967)
   at knn.KnnGraphTester.getNN(KnnGraphTester.java:812)
   at knn.KnnGraphTester.run(KnnGraphTester.java:438)
   at knn.KnnGraphTester.runWithCleanUp(KnnGraphTester.java:177)
   at knn.KnnGraphTester.main(KnnGraphTester.java:172)
   Caused by: java.io.EOFException: read past EOF: 
MemorySegmentIndexInput(path="/Users/vigyas/forks/bench/util/knnIndices/cohere-wikipedia-docs-768d.vec-32-50-multiVector.index/_0_Lucene99HnswMultiVectorsFormat_0.vecmv")
 [slice=multi-vector-data]
   at 
org.apache.lucene.store.MemorySegmentIndexInput.readByte(MemorySegmentIndexInput.java:146)
   at org.apache.lucene.store.DataInput.readInt(DataInput.java:95)
   at 
org.apache.lucene.store.MemorySegmentIndexInput.readInt(MemorySegmentIndexInput.java:261)
   at org.apache.lucene.store.DataInput.readFloats(DataInput.java:202)
   at 
org.apache.lucene.store.MemorySegmentIndexInput.readFloats(MemorySegmentIndexInput.java:231)
   at 
org.apache.lucene.codecs.lucene99.OffHeapFloatMultiVectorValues.vectorValue(OffHeapFloatMultiVectorValues.java:111)
   at 
org.apache.lucene.codecs.lucene99.OffHeapFloatMultiVectorValues.vectorValue(OffHeapFloatMultiVectorValues.java:130)
   at 
org.apache.lucene.codecs.hnsw.DefaultFlatMultiVectorScorer$FloatMultiVectorScorer.score(DefaultFlatMultiVectorScorer.java:185)
   at 
org.apache.lucene.codecs.lucene99.OffHeapFloatMultiVectorValues$DenseOffHeapMultiVectorValues$1.score(OffHeapFloatMultiVectorValues.java:248)
   at 
org.apache.lucene.search.AbstractKnnVectorQuery.exactSearch(AbstractKnnVectorQuery.java:220)
   at 
knn.KnnFloatVectorBenchmarkQuery.exactSearch(KnnFloatVectorBenchmarkQuery.java:33)
   at 
knn.KnnFloatVectorBenchmarkQuery.runExactSearch(KnnFloatVectorBenchmarkQuery.java:50)
   at 
knn.KnnGraphTester$ComputeBaselineNNFloatTask.call(KnnGraphTester.java:)
   ... 5 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]

2024-10-26 Thread via GitHub


zkendall commented on PR #2686:
URL: https://github.com/apache/lucene-solr/pull/2686#issuecomment-2439871664

   Closing in favor of PR to solr repo: https://github.com/apache/solr/pull/2799


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [SOLR-11191] SolrIndexSplitter: Init support for routing docs by _root_ when available [lucene-solr]

2024-10-26 Thread via GitHub


zkendall closed pull request #2686: [SOLR-11191] SolrIndexSplitter: Init 
support for routing docs by _root_ when available
URL: https://github.com/apache/lucene-solr/pull/2686


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

2024-10-26 Thread via GitHub


stefanvodita commented on code in PR #13914:
URL: https://github.com/apache/lucene/pull/13914#discussion_r1817832373


##
lucene/facet/src/java/org/apache/lucene/facet/range/DynamicRangeUtil.java:
##
@@ -202,66 +208,83 @@ public SegmentOutput(int hitsLength) {
* is used to compute the equi-weight per bin.
*/
   public static List computeDynamicNumericRanges(
-  long[] values, long[] weights, int len, long totalWeight, int topN) {
+  long[] values, long[] weights, int len, long totalValue, long 
totalWeight, int topN) {
 assert values.length == weights.length && len <= values.length && len >= 0;
 assert topN >= 0;
 List dynamicRangeResult = new ArrayList<>();
 if (len == 0 || topN == 0) {
   return dynamicRangeResult;
 }
 
-new InPlaceMergeSorter() {
-  @Override
-  protected int compare(int index1, int index2) {
-int cmp = Long.compare(values[index1], values[index2]);
-if (cmp == 0) {
-  // If the values are equal, sort based on the weights.
-  // Any weight order is correct as long as it's deterministic.
-  return Long.compare(weights[index1], weights[index2]);
-}
-return cmp;
-  }
+double rangeWeightTarget = (double) totalWeight / topN;
+double[] kWeights = new double[topN];
+for (int i = 0; i < topN; i++) {
+  kWeights[i] = (i == 0 ? 0 : kWeights[i - 1]) + rangeWeightTarget;

Review Comment:
   I thought maybe you wanted to avoid the multiplications 😄 
   Which would be fair, my guess is the second one is faster because we're only 
doing sums and referencing values in the array that are cached.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use multi-select instead of a full sort for DynamicRange creation [lucene]

2024-10-26 Thread via GitHub


stefanvodita commented on code in PR #13914:
URL: https://github.com/apache/lucene/pull/13914#discussion_r1817835241


##
lucene/core/src/java/org/apache/lucene/util/WeightedSelector.java:
##
@@ -0,0 +1,407 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util;
+
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.SplittableRandom;
+
+/**
+ * Adaptive selection algorithm based on the introspective quick select 
algorithm. The quick select
+ * algorithm uses an interpolation variant of Tukey's ninther 
median-of-medians for pivot, and
+ * Bentley-McIlroy 3-way partitioning. For the introspective protection, it 
shuffles the sub-range
+ * if the max recursive depth is exceeded.
+ *
+ * This selection algorithm is fast on most data shapes, especially on 
nearly sorted data, or
+ * when k is close to the boundaries. It runs in linear time on average.
+ *
+ * @lucene.internal
+ */
+public abstract class WeightedSelector {
+
+  // This selector is used repeatedly by the radix selector for sub-ranges of 
less than
+  // 100 entries. This means this selector is also optimized to be fast on 
small ranges.
+  // It uses the variant of medians-of-medians and 3-way partitioning, and 
finishes the
+  // last tiny range (3 entries or less) with a very specialized sort.
+
+  private SplittableRandom random;
+
+  protected abstract long getWeight(int i);
+
+  protected abstract long getValue(int i);
+
+  public final WeightRangeInfo[] select(
+  int from,
+  int to,
+  long rangeTotalValue,
+  long beforeTotalValue,
+  long rangeWeight,
+  long beforeWeight,
+  double[] kWeights) {

Review Comment:
   Does it make sense to replace `k` with `quantile` maybe?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove LeafSimScorer abstraction. [lucene]

2024-10-26 Thread via GitHub


jpountz merged PR #13957:
URL: https://github.com/apache/lucene/pull/13957


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Absolutely horrible Lucene performance with JDK 23 (Lucene 9.11.1 and 10.0.0) [lucene]

2024-10-26 Thread via GitHub


jpountz commented on issue #13959:
URL: https://github.com/apache/lucene/issues/13959#issuecomment-2439549339

   Is there also no regression if you use the default garbage collector? If so, 
this looks like a regression with Shenandoah.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Absolutely horrible Lucene performance with JDK 23 (Lucene 9.11.1 and 10.0.0) [lucene]

2024-10-26 Thread via GitHub


derreisende77 commented on issue #13959:
URL: https://github.com/apache/lucene/issues/13959#issuecomment-2439550280

   I have no problem with default GC as well.
   I downloaded a Windows Amazon Corretto 23 nightly build from today and I 
don't have a problem with Shenandoah anymore. So I guess it is a Shenandoah 
problem in JDK 23 and 23.0.1 that will be fixed in a later 23 release.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Absolutely horrible Lucene performance with JDK 23 (Lucene 9.11.1 and 10.0.0) [lucene]

2024-10-26 Thread via GitHub


derreisende77 commented on issue #13959:
URL: https://github.com/apache/lucene/issues/13959#issuecomment-2439422246

   @benwtrent I made several tests on macOS with JDK 23 and 23.0.1 from 
Liberica and Azul. I always ran into the performance problem.
   I switched from Shenandoah GC to ZGC with
   ```
   -XX:+UseZGC
   -XX:+ZGenerational
   -XX:ZUncommitDelay=5
   -XX:+ZUncommit
   -XX:SoftMaxHeapSize=4g
   ```
   and so far was unable to trigger the performance problem. Lucene behaves as 
performant as before with JDK 23


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Absolutely horrible Lucene performance with JDK 23 (Lucene 9.11.1 and 10.0.0) [lucene]

2024-10-26 Thread via GitHub


derreisende77 closed issue #13959: Absolutely horrible Lucene performance with 
JDK 23 (Lucene 9.11.1 and 10.0.0)
URL: https://github.com/apache/lucene/issues/13959


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org