Re: [PR] Use a MergeSorter taking advantage of extra storage for StableMSBRadixSorter [lucene]

2023-10-05 Thread via GitHub


jpountz commented on code in PR #12623:
URL: https://github.com/apache/lucene/pull/12623#discussion_r1346923517


##
lucene/core/src/java/org/apache/lucene/util/StableMSBRadixSorter.java:
##
@@ -78,4 +78,60 @@ protected void reorder(int from, int to, int[] startOffsets, 
int[] endOffsets, i
 }
 restore(from, to);
   }
+
+  /** A MergeSorter taking advantage of temporary storage. */
+  protected abstract class MergeSorter extends Sorter {
+@Override
+public void sort(int from, int to) {
+  checkRange(from, to);
+  mergeSort(from, to);
+}
+
+private void mergeSort(int from, int to) {
+  if (to - from < BINARY_SORT_THRESHOLD) {
+binarySort(from, to);
+  } else {
+final int mid = (from + to) >>> 1;
+mergeSort(from, mid);
+mergeSort(mid, to);
+merge(from, to, mid);
+  }
+}
+
+/**
+ * We tried to expose this to implementations to get a bulk copy 
optimization. But it did not
+ * bring a noticeable improvement in benchmark as {@code len} is usually 
small.
+ */
+private void bulkSave(int from, int tmpFrom, int len) {
+  for (int i = 0; i < len; i++) {
+save(from + i, tmpFrom + i);
+  }
+}
+
+private void merge(int from, int to, int mid) {
+  assert to > mid && mid > from;

Review Comment:
   In merge sort, it is common to check if the value at mid-1 is less than or 
equal to the value at mid, to save work in case the data is already (partially) 
sorted, maybe we could do that here too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] SOLR-16843: Replace timeNs by epochTimeNs in most of autoscaling [lucene-solr]

2023-10-05 Thread via GitHub


psalagnac opened a new pull request, #2679:
URL: https://github.com/apache/lucene-solr/pull/2679

   [SOLR-16843](https://issues.apache.org/jira/browse/SOLR-16843)
   
   
   
   
   # Description
   
   Autoscaling framework use timestamps returned by the JVM call 
System.nanoTime(), but according to the Javadoc, this is NOT an absolute 
timestamp. This is just a number relative to a random origin, and this origin 
will change each time the JVM is restarted.
   
   This timestamp cannot be re-used across JVM instances (either in another 
Solr node or same node after JVM restart).
   
   # Solution
   
   For all timestamps that are either persisted at some point or used for event 
timestamps, use `getEpochTimeNs()` instead of `getTimeNs()`. Values returned by 
`getEpochTimeNs()` are absolute and can be safely compared.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] TaskExecutor waits for all tasks to complete before returning [lucene]

2023-10-05 Thread via GitHub


javanna merged PR #12523:
URL: https://github.com/apache/lucene/pull/12523


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] TaskExecutor waits for all tasks to complete before returning [lucene]

2023-10-05 Thread via GitHub


javanna commented on PR #12523:
URL: https://github.com/apache/lucene/pull/12523#issuecomment-1748362692

   Thanks @quux00 !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use a MergeSorter taking advantage of extra storage for StableMSBRadixSorter [lucene]

2023-10-05 Thread via GitHub


gf2121 commented on code in PR #12623:
URL: https://github.com/apache/lucene/pull/12623#discussion_r1347069694


##
lucene/core/src/java/org/apache/lucene/util/StableMSBRadixSorter.java:
##
@@ -78,4 +78,60 @@ protected void reorder(int from, int to, int[] startOffsets, 
int[] endOffsets, i
 }
 restore(from, to);
   }
+
+  /** A MergeSorter taking advantage of temporary storage. */
+  protected abstract class MergeSorter extends Sorter {
+@Override
+public void sort(int from, int to) {
+  checkRange(from, to);
+  mergeSort(from, to);
+}
+
+private void mergeSort(int from, int to) {
+  if (to - from < BINARY_SORT_THRESHOLD) {
+binarySort(from, to);
+  } else {
+final int mid = (from + to) >>> 1;
+mergeSort(from, mid);
+mergeSort(mid, to);
+merge(from, to, mid);
+  }
+}
+
+/**
+ * We tried to expose this to implementations to get a bulk copy 
optimization. But it did not
+ * bring a noticeable improvement in benchmark as {@code len} is usually 
small.
+ */
+private void bulkSave(int from, int tmpFrom, int len) {
+  for (int i = 0; i < len; i++) {
+save(from + i, tmpFrom + i);
+  }
+}
+
+private void merge(int from, int to, int mid) {
+  assert to > mid && mid > from;

Review Comment:
   Great advice! Thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use a MergeSorter taking advantage of extra storage for StableMSBRadixSorter [lucene]

2023-10-05 Thread via GitHub


gf2121 merged PR #12623:
URL: https://github.com/apache/lucene/pull/12623


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Compute better windows in MaxScoreBulkScorer. [lucene]

2023-10-05 Thread via GitHub


jpountz merged PR #12593:
URL: https://github.com/apache/lucene/pull/12593


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Ability to compute vector similarity scores with DoubleValuesSource [lucene]

2023-10-05 Thread via GitHub


stefanvodita commented on code in PR #12548:
URL: https://github.com/apache/lucene/pull/12548#discussion_r1347266594


##
lucene/core/src/java/org/apache/lucene/search/FloatVectorSimilarityValuesSource.java:
##
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Objects;
+import org.apache.lucene.index.FloatVectorValues;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorSimilarityFunction;
+
+/**
+ * A {@link DoubleValuesSource} which computes the vector similarity scores 
between the query vector
+ * and the {@link org.apache.lucene.document.KnnFloatVectorField} for 
documents.
+ */
+class FloatVectorSimilarityValuesSource extends DoubleValuesSource {

Review Comment:
   Great! I think this is the right approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-05 Thread via GitHub


robertvanwinkle1138 commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1749187258

   The SPANN paper does not address efficient filtered queries.  Lucene's HNSW 
calculates the similarity score for every record, regardless of the record 
matching the filter.  
   
   Filtered − DiskANN [1] describes a solution for efficient filtered queries.  
   
   QDrant has a filter solution however the methodology described in their blog 
is opaque.  
   
   1. https://dl.acm.org/doi/pdf/10.1145/3543507.3583552
   
   > As Approximate Nearest Neighbor Search (ANNS)-based dense retrieval 
becomes ubiquitous for search and recommendation scenarios, efciently answering 
fltered ANNS queries has become a critical requirement. Filtered ANNS queries 
ask for the nearest neighbors of a query’s embedding from the points in the 
index that match the query’s labels such as date, price range, language. There 
has been little prior work on algorithms that use label metadata associated 
with vector data to build efcient indices for fltered ANNS queries. 
Consequently, current indices have high search latency or low recall which is 
not practical in interactive web-scenarios. We present two algorithms with 
native support for faster and more accurate fltered ANNS queries: one with 
streaming support, and another based on batch construction. Central to our 
algorithms is the construction of a graph-structured index which forms 
connections not only based on the geometry of the vector data, but also the 
associated lab
 el set. On real-world data with natural labels, both algorithms are an order 
of magnitude or more efcient for fltered queries than the current state of the 
art algorithms. The generated indices also be queried from an SSD and support 
thousands of queries per second at over 90% recall@10.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow implementers of AbstractKnnVectorQuery to access final topK results [lucene]

2023-10-05 Thread via GitHub


kaivalnp commented on PR #12590:
URL: https://github.com/apache/lucene/pull/12590#issuecomment-1749204748

   Hi @benwtrent @mikemccand can someone help merge this in / let me know if 
there's anything pending?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-10-05 Thread via GitHub


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1749322588

   > QDrant has a filter solution however the methodology described in their 
blog is opaque.
   
   QDrant's HNSW filter solution is the exact same as Lucene's. You can look at 
the code, they don't filter candidate exploration but filer result collection.
   
   You are correct that filtering with SPANN would be different. Though I am 
not sure its intractable. 
   
   It is possible that the candidate postings (gathered via HNSW) don't contain 
ANY filtered docs. This would require gathering more candidate postings.
   
   But I think we can do that before scoring. So, as candidate posting lists 
are gathered, ensure they have some candidates. 
   
   But I am pretty sure the SPANN repository supports filtering, and its OSS, 
so we could always just read what they did.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow implementers of AbstractKnnVectorQuery to access final topK results [lucene]

2023-10-05 Thread via GitHub


benwtrent commented on PR #12590:
URL: https://github.com/apache/lucene/pull/12590#issuecomment-1749324868

   @kaivalnp && @mikemccand I can merge and backport to 9x


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow implementers of AbstractKnnVectorQuery to access final topK results [lucene]

2023-10-05 Thread via GitHub


kaivalnp commented on PR #12590:
URL: https://github.com/apache/lucene/pull/12590#issuecomment-1749418274

   Thanks for all the help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make IndexWriter#flushNextBuffer flush deletes too? [lucene]

2023-10-05 Thread via GitHub


s1monw closed issue #12572: Make IndexWriter#flushNextBuffer flush deletes too?
URL: https://github.com/apache/lucene/issues/12572


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Make IndexWriter#flushNextBuffer flush deletes too? [lucene]

2023-10-05 Thread via GitHub


s1monw commented on issue #12572:
URL: https://github.com/apache/lucene/issues/12572#issuecomment-1749458532

   After digging into this and opening a PR for it I think this is unnecessary. 
I tried to beef up tests for this and this caused me to refresh my knowledge 
how stuff works down in the IW / DWPT. Every time we flush a DWPT we do freeze 
the global deletes buffer and push it to the queue. Which essentially means we 
are applying deletes no matter how much memory it consumes. Digging deeper I 
think we can / should do some cleanups in the IW regarding deletes. I did start 
with good / better testing and will come up with some ideas in different 
PRs/Issues


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make IndexWriter#flushNextBuffer also apply deletes if necessary [lucene]

2023-10-05 Thread via GitHub


s1monw commented on PR #12595:
URL: https://github.com/apache/lucene/pull/12595#issuecomment-1749460562

   see https://github.com/apache/lucene/issues/12572#issuecomment-1749458532


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make IndexWriter#flushNextBuffer also apply deletes if necessary [lucene]

2023-10-05 Thread via GitHub


s1monw closed pull request #12595: Make IndexWriter#flushNextBuffer also apply 
deletes if necessary
URL: https://github.com/apache/lucene/pull/12595


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Allow implementers of AbstractKnnVectorQuery to access final topK results? [lucene]

2023-10-05 Thread via GitHub


benwtrent closed issue #12575: Allow implementers of AbstractKnnVectorQuery to 
access final topK results?
URL: https://github.com/apache/lucene/issues/12575


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow implementers of AbstractKnnVectorQuery to access final topK results [lucene]

2023-10-05 Thread via GitHub


benwtrent merged PR #12590:
URL: https://github.com/apache/lucene/pull/12590


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] `FSTCompiler.Builder` should have an option to stream the FST bytes directly to Directory [lucene]

2023-10-05 Thread via GitHub


dungba88 commented on issue #12543:
URL: https://github.com/apache/lucene/issues/12543#issuecomment-1749982744

   One of the thing I think is missing is that those byte manipulation methods 
should not be called after calling `#finish()`, but currently there is no such 
enforcement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org