[GitHub] [lucene] benwtrent commented on pull request #12434: Add ParentJoin KNN support
benwtrent commented on PR #12434: URL: https://github.com/apache/lucene/pull/12434#issuecomment-1632431376 @alessandrobenedetti I took some of your ideas on deduplicating vector IDs based on some other id for this PR. If this work continues, I think some of it can transfer to the native multi-vector support in Lucene. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores
benwtrent commented on issue #12342: URL: https://github.com/apache/lucene/issues/12342#issuecomment-1632450075 Thank you for the deep information @searchivarius . eagerly waiting your results @jmazanec15 :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] easyice opened a new pull request, #12435: Remove sort for uniqueValues in NumericDocValues
easyice opened a new pull request, #12435: URL: https://github.com/apache/lucene/pull/12435 ### Description In table compression, it only need a mapping for value -> ord, as long as we can get the value via ord on reading, the order of values does not important. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12405: Skip docs with Docvalues in NumericLeafComparator
jpountz commented on PR #12405: URL: https://github.com/apache/lucene/pull/12405#issuecomment-1632507845 Thanks for adding the enum. In my view, we now need the two following changes: - `isMissingValueCompetitive()` should return false if the missing value is equal to the bottom value and the pruning is SKIP_MORE - The competitive iterator can better tune the min/max values. For instance currently, if the bottom value is 5 and the sort is ascending, we'll use a range on [MIN_VALUE, 5] to filter competitive hits. But with SKIP_MORE, we could now make it [MIN_VALUE, 4]. FYI we have `NumericUtils#(add|subtract)` to add/subtract on binary representations of numbers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #12434: Add ParentJoin KNN support
benwtrent commented on PR #12434: URL: https://github.com/apache/lucene/pull/12434#issuecomment-1632517795 > would it be enough or is there more? I will dig a bit more on making this cleaner. My biggest performance concerns are around keeping track of the heap-index -> ID and shuffling those around so often and resolving the docId by vector ordinal on every push. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #12434: Add ParentJoin KNN support
benwtrent commented on PR #12434: URL: https://github.com/apache/lucene/pull/12434#issuecomment-1633057341 @jpountz I took another shot at the KnnResults interface. I restricted the abstract and `@Override` methods to narrow the API. Additionally, I disconnected it from the queue, but it still has a queue object internally that sub-classes can utilize. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] almogtavor commented on issue #12406: Register nested queries (ToParentBlockJoinQuery) to Lucene Monitor
almogtavor commented on issue #12406: URL: https://github.com/apache/lucene/issues/12406#issuecomment-1633133797 @romseygeek @dweiss @uschindler I'd love to get feedback from you on the subject @jpountz @benwtrent I saw that Elasticsearch does have the option of percolating nested queries. I wonder if its got the simillar optimizations of Lucene Monitor, or is it just query that gets executed every X seconds. Solr doesn't have an equivalent. @epugh -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #12421: Concurrent hnsw graph and builder, take two
benwtrent commented on code in PR #12421: URL: https://github.com/apache/lucene/pull/12421#discussion_r1261778701 ## lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswGraphBuilder.java: ## @@ -0,0 +1,465 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util.hnsw; + +import static java.lang.Math.log; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.Objects; +import java.util.Set; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.CompletionException; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.ConcurrentSkipListSet; +import java.util.concurrent.ExecutionException; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.util.concurrent.Semaphore; +import java.util.concurrent.ThreadLocalRandom; +import java.util.concurrent.TimeUnit; +import java.util.concurrent.atomic.AtomicReference; +import java.util.function.Supplier; +import org.apache.lucene.index.VectorEncoding; +import org.apache.lucene.index.VectorSimilarityFunction; +import org.apache.lucene.util.GrowableBitSet; +import org.apache.lucene.util.InfoStream; +import org.apache.lucene.util.NamedThreadFactory; +import org.apache.lucene.util.ThreadInterruptedException; +import org.apache.lucene.util.hnsw.ConcurrentOnHeapHnswGraph.NodeAtLevel; + +/** + * Builder for Concurrent HNSW graph. See {@link HnswGraph} for a high level overview, and the + * comments to `addGraphNode` for details on the concurrent building approach. + * + * @param the type of vector + */ +public class ConcurrentHnswGraphBuilder { + + /** Default number of maximum connections per node */ + public static final int DEFAULT_MAX_CONN = 16; + + /** + * Default number of the size of the queue maintained while searching during a graph construction. + */ + public static final int DEFAULT_BEAM_WIDTH = 100; + + /** A name for the HNSW component for the info-stream */ + public static final String HNSW_COMPONENT = "HNSW"; + + private final int beamWidth; + private final double ml; + private final ExplicitThreadLocal scratchNeighbors; + + private final VectorSimilarityFunction similarityFunction; + private final VectorEncoding vectorEncoding; + private final RandomAccessVectorValues vectors; + private final ExplicitThreadLocal> graphSearcher; + private final ExplicitThreadLocal beamCandidates; + + final ConcurrentOnHeapHnswGraph hnsw; + private final ConcurrentSkipListSet insertionsInProgress = + new ConcurrentSkipListSet<>(); + + private InfoStream infoStream = InfoStream.getDefault(); + + // we need two sources of vectors in order to perform diversity check comparisons without + // colliding + private final RandomAccessVectorValues vectorsCopy; + + /** This is the "native" factory for ConcurrentHnswGraphBuilder. */ + public static ConcurrentHnswGraphBuilder create( + RandomAccessVectorValues vectors, + VectorEncoding vectorEncoding, + VectorSimilarityFunction similarityFunction, + int M, + int beamWidth) + throws IOException { +return new ConcurrentHnswGraphBuilder<>( +vectors, vectorEncoding, similarityFunction, M, beamWidth); + } + + /** + * Reads all the vectors from vector values, builds a graph connecting them by their dense + * ordinals, using the given hyperparameter settings, and returns the resulting graph. + * + * @param vectors the vectors whose relations are represented by the graph - must provide a + * different view over those vectors than the one used to add via addGraphNode. + * @param M – graph fanout parameter used to calculate the maximum number of connections a node + * can have – M on upper layers, and M * 2 on the lowest level. + * @param beamWidth the size of the beam search to use when finding nearest neighbors. + */ + public ConcurrentHnswGraphBuilder( + RandomAccessVectorValues vectors, + VectorEncoding vectorEncoding, + VectorSimilarityFunction similarityFunction, + int M, + int beamWidth) + throws IOException { +this.vectors = vector
[GitHub] [lucene] mayya-sharipova opened a new pull request, #12436: Move max vector dims limit to Codec
mayya-sharipova opened a new pull request, #12436: URL: https://github.com/apache/lucene/pull/12436 Move vector max dimension limits enforcement into the default Codec's KnnVectorsFormat implementation. This allows different implementation of knn search algorithms define their own limits of a maximum vector dimensions that they can handle. Closes #12309 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shubhamvishu commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection
shubhamvishu commented on PR #12427: URL: https://github.com/apache/lucene/pull/12427#issuecomment-1633489589 > This is a situation where we really cannot sort on behalf of the caller, so it might be a bit confusing/trappy to sort some flavors of this method but not others? Maybe it's best to leave these methods as they are? Agreed @gsmiller! Yes it does seem maybe its better leave this as is. > we could look at changing the assert on line 276 of StringsToAutomaton to throw an explicit IllegalArgumentException so that we don't silently built a corrupt automaton on unordered input (with asserts disabled). There would add overhead since we have to now keep track of the previous term all the time, but maybe it's worth benchmarking and considering this change? I like the idea to throw IAE if wrong input is provided. I think this would only affect the cases with assert disabled? otherwise with asserts enabled we anyways always keep track of previous term. > maybe we can relax StringsToAutomaton#build(Collection, boolean) to StringsToAutomaton#build(Iterable input, boolean). This change will also make it more consistent with the build(BytesRefIterator input, boolean asBinary) method. Good point @gautamworah96. `Iterable` seems like a better way instead of `Collection` and would indeed be consistent with other method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shubhamvishu commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection
shubhamvishu commented on PR #12427: URL: https://github.com/apache/lucene/pull/12427#issuecomment-1633502769 On the same note, since both the methods expects `Iterable` or `Iterators` why do we even need 2 separate methods here which are doing exactly the same thing i.e. iterating over the `ByteRef`'s and adding to automaton. It would have been much better if `BytesRefIterator` implemented the `Iterable`, `Iterator` interfaces in which case we could just have one method `StringsToAutomaton#build(Iterable)` which takes an `Iterable`. I don't see why do we even have a separate `BytesRefIterator` interface and not just using `Iterator` instead(maybe because its legacy code?) or is it that I'm missing something important here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org