Re: [PR] Use a MergeSorter taking advantage of extra storage for StableMSBRadixSorter [lucene]
jpountz commented on code in PR #12623: URL: https://github.com/apache/lucene/pull/12623#discussion_r1346923517 ## lucene/core/src/java/org/apache/lucene/util/StableMSBRadixSorter.java: ## @@ -78,4 +78,60 @@ protected void reorder(int from, int to, int[] startOffsets, int[] endOffsets, i } restore(from, to); } + + /** A MergeSorter taking advantage of temporary storage. */ + protected abstract class MergeSorter extends Sorter { +@Override +public void sort(int from, int to) { + checkRange(from, to); + mergeSort(from, to); +} + +private void mergeSort(int from, int to) { + if (to - from < BINARY_SORT_THRESHOLD) { +binarySort(from, to); + } else { +final int mid = (from + to) >>> 1; +mergeSort(from, mid); +mergeSort(mid, to); +merge(from, to, mid); + } +} + +/** + * We tried to expose this to implementations to get a bulk copy optimization. But it did not + * bring a noticeable improvement in benchmark as {@code len} is usually small. + */ +private void bulkSave(int from, int tmpFrom, int len) { + for (int i = 0; i < len; i++) { +save(from + i, tmpFrom + i); + } +} + +private void merge(int from, int to, int mid) { + assert to > mid && mid > from; Review Comment: In merge sort, it is common to check if the value at mid-1 is less than or equal to the value at mid, to save work in case the data is already (partially) sorted, maybe we could do that here too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] SOLR-16843: Replace timeNs by epochTimeNs in most of autoscaling [lucene-solr]
psalagnac opened a new pull request, #2679: URL: https://github.com/apache/lucene-solr/pull/2679 [SOLR-16843](https://issues.apache.org/jira/browse/SOLR-16843) # Description Autoscaling framework use timestamps returned by the JVM call System.nanoTime(), but according to the Javadoc, this is NOT an absolute timestamp. This is just a number relative to a random origin, and this origin will change each time the JVM is restarted. This timestamp cannot be re-used across JVM instances (either in another Solr node or same node after JVM restart). # Solution For all timestamps that are either persisted at some point or used for event timestamps, use `getEpochTimeNs()` instead of `getTimeNs()`. Values returned by `getEpochTimeNs()` are absolute and can be safely compared. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] TaskExecutor waits for all tasks to complete before returning [lucene]
javanna merged PR #12523: URL: https://github.com/apache/lucene/pull/12523 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] TaskExecutor waits for all tasks to complete before returning [lucene]
javanna commented on PR #12523: URL: https://github.com/apache/lucene/pull/12523#issuecomment-1748362692 Thanks @quux00 ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a MergeSorter taking advantage of extra storage for StableMSBRadixSorter [lucene]
gf2121 commented on code in PR #12623: URL: https://github.com/apache/lucene/pull/12623#discussion_r1347069694 ## lucene/core/src/java/org/apache/lucene/util/StableMSBRadixSorter.java: ## @@ -78,4 +78,60 @@ protected void reorder(int from, int to, int[] startOffsets, int[] endOffsets, i } restore(from, to); } + + /** A MergeSorter taking advantage of temporary storage. */ + protected abstract class MergeSorter extends Sorter { +@Override +public void sort(int from, int to) { + checkRange(from, to); + mergeSort(from, to); +} + +private void mergeSort(int from, int to) { + if (to - from < BINARY_SORT_THRESHOLD) { +binarySort(from, to); + } else { +final int mid = (from + to) >>> 1; +mergeSort(from, mid); +mergeSort(mid, to); +merge(from, to, mid); + } +} + +/** + * We tried to expose this to implementations to get a bulk copy optimization. But it did not + * bring a noticeable improvement in benchmark as {@code len} is usually small. + */ +private void bulkSave(int from, int tmpFrom, int len) { + for (int i = 0; i < len; i++) { +save(from + i, tmpFrom + i); + } +} + +private void merge(int from, int to, int mid) { + assert to > mid && mid > from; Review Comment: Great advice! Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use a MergeSorter taking advantage of extra storage for StableMSBRadixSorter [lucene]
gf2121 merged PR #12623: URL: https://github.com/apache/lucene/pull/12623 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Compute better windows in MaxScoreBulkScorer. [lucene]
jpountz merged PR #12593: URL: https://github.com/apache/lucene/pull/12593 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Ability to compute vector similarity scores with DoubleValuesSource [lucene]
stefanvodita commented on code in PR #12548: URL: https://github.com/apache/lucene/pull/12548#discussion_r1347266594 ## lucene/core/src/java/org/apache/lucene/search/FloatVectorSimilarityValuesSource.java: ## @@ -0,0 +1,101 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Objects; +import org.apache.lucene.index.FloatVectorValues; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.VectorSimilarityFunction; + +/** + * A {@link DoubleValuesSource} which computes the vector similarity scores between the query vector + * and the {@link org.apache.lucene.document.KnnFloatVectorField} for documents. + */ +class FloatVectorSimilarityValuesSource extends DoubleValuesSource { Review Comment: Great! I think this is the right approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]
robertvanwinkle1138 commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1749187258 The SPANN paper does not address efficient filtered queries. Lucene's HNSW calculates the similarity score for every record, regardless of the record matching the filter. Filtered − DiskANN [1] describes a solution for efficient filtered queries. QDrant has a filter solution however the methodology described in their blog is opaque. 1. https://dl.acm.org/doi/pdf/10.1145/3543507.3583552 > As Approximate Nearest Neighbor Search (ANNS)-based dense retrieval becomes ubiquitous for search and recommendation scenarios, efciently answering fltered ANNS queries has become a critical requirement. Filtered ANNS queries ask for the nearest neighbors of a query’s embedding from the points in the index that match the query’s labels such as date, price range, language. There has been little prior work on algorithms that use label metadata associated with vector data to build efcient indices for fltered ANNS queries. Consequently, current indices have high search latency or low recall which is not practical in interactive web-scenarios. We present two algorithms with native support for faster and more accurate fltered ANNS queries: one with streaming support, and another based on batch construction. Central to our algorithms is the construction of a graph-structured index which forms connections not only based on the geometry of the vector data, but also the associated lab el set. On real-world data with natural labels, both algorithms are an order of magnitude or more efcient for fltered queries than the current state of the art algorithms. The generated indices also be queried from an SSD and support thousands of queries per second at over 90% recall@10. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow implementers of AbstractKnnVectorQuery to access final topK results [lucene]
kaivalnp commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1749204748 Hi @benwtrent @mikemccand can someone help merge this in / let me know if there's anything pending? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]
benwtrent commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1749322588 > QDrant has a filter solution however the methodology described in their blog is opaque. QDrant's HNSW filter solution is the exact same as Lucene's. You can look at the code, they don't filter candidate exploration but filer result collection. You are correct that filtering with SPANN would be different. Though I am not sure its intractable. It is possible that the candidate postings (gathered via HNSW) don't contain ANY filtered docs. This would require gathering more candidate postings. But I think we can do that before scoring. So, as candidate posting lists are gathered, ensure they have some candidates. But I am pretty sure the SPANN repository supports filtering, and its OSS, so we could always just read what they did. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow implementers of AbstractKnnVectorQuery to access final topK results [lucene]
benwtrent commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1749324868 @kaivalnp && @mikemccand I can merge and backport to 9x -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow implementers of AbstractKnnVectorQuery to access final topK results [lucene]
kaivalnp commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1749418274 Thanks for all the help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make IndexWriter#flushNextBuffer flush deletes too? [lucene]
s1monw closed issue #12572: Make IndexWriter#flushNextBuffer flush deletes too? URL: https://github.com/apache/lucene/issues/12572 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make IndexWriter#flushNextBuffer flush deletes too? [lucene]
s1monw commented on issue #12572: URL: https://github.com/apache/lucene/issues/12572#issuecomment-1749458532 After digging into this and opening a PR for it I think this is unnecessary. I tried to beef up tests for this and this caused me to refresh my knowledge how stuff works down in the IW / DWPT. Every time we flush a DWPT we do freeze the global deletes buffer and push it to the queue. Which essentially means we are applying deletes no matter how much memory it consumes. Digging deeper I think we can / should do some cleanups in the IW regarding deletes. I did start with good / better testing and will come up with some ideas in different PRs/Issues -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make IndexWriter#flushNextBuffer also apply deletes if necessary [lucene]
s1monw commented on PR #12595: URL: https://github.com/apache/lucene/pull/12595#issuecomment-1749460562 see https://github.com/apache/lucene/issues/12572#issuecomment-1749458532 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make IndexWriter#flushNextBuffer also apply deletes if necessary [lucene]
s1monw closed pull request #12595: Make IndexWriter#flushNextBuffer also apply deletes if necessary URL: https://github.com/apache/lucene/pull/12595 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Allow implementers of AbstractKnnVectorQuery to access final topK results? [lucene]
benwtrent closed issue #12575: Allow implementers of AbstractKnnVectorQuery to access final topK results? URL: https://github.com/apache/lucene/issues/12575 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow implementers of AbstractKnnVectorQuery to access final topK results [lucene]
benwtrent merged PR #12590: URL: https://github.com/apache/lucene/pull/12590 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] `FSTCompiler.Builder` should have an option to stream the FST bytes directly to Directory [lucene]
dungba88 commented on issue #12543: URL: https://github.com/apache/lucene/issues/12543#issuecomment-1749982744 One of the thing I think is missing is that those byte manipulation methods should not be called after calling `#finish()`, but currently there is no such enforcement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org