[GitHub] [lucene] benwtrent commented on pull request #12434: Add ParentJoin KNN support

2023-07-12 Thread via GitHub


benwtrent commented on PR #12434:
URL: https://github.com/apache/lucene/pull/12434#issuecomment-1632431376

   @alessandrobenedetti I took some of your ideas on deduplicating vector IDs 
based on some other id for this PR. If this work continues, I think some of it 
can transfer to the native multi-vector support in Lucene.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

2023-07-12 Thread via GitHub


benwtrent commented on issue #12342:
URL: https://github.com/apache/lucene/issues/12342#issuecomment-1632450075

   Thank you for the deep information @searchivarius .
   
   eagerly waiting your results @jmazanec15 :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] easyice opened a new pull request, #12435: Remove sort for uniqueValues in NumericDocValues

2023-07-12 Thread via GitHub


easyice opened a new pull request, #12435:
URL: https://github.com/apache/lucene/pull/12435

   ### Description
   
   
   In table compression, it only need a mapping for value -> ord, as long as we 
can get the value via ord on reading, the order of values does not important.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12405: Skip docs with Docvalues in NumericLeafComparator

2023-07-12 Thread via GitHub


jpountz commented on PR #12405:
URL: https://github.com/apache/lucene/pull/12405#issuecomment-1632507845

   Thanks for adding the enum. In my view, we now need the two following 
changes:
- `isMissingValueCompetitive()` should return false if the missing value is 
equal to the bottom value and the pruning is SKIP_MORE
- The competitive iterator can better tune the min/max values. For instance 
currently, if the bottom value is 5 and the sort is ascending, we'll use a 
range on [MIN_VALUE, 5] to filter competitive hits. But with SKIP_MORE, we 
could now make it [MIN_VALUE, 4]. FYI we have `NumericUtils#(add|subtract)` to 
add/subtract on binary representations of numbers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12434: Add ParentJoin KNN support

2023-07-12 Thread via GitHub


benwtrent commented on PR #12434:
URL: https://github.com/apache/lucene/pull/12434#issuecomment-1632517795

   > would it be enough or is there more?
   
   I will dig a bit more on making this cleaner. 
   
   My biggest performance concerns are around keeping track of the heap-index 
-> ID and shuffling those around so often and resolving the docId by vector 
ordinal on every push.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12434: Add ParentJoin KNN support

2023-07-12 Thread via GitHub


benwtrent commented on PR #12434:
URL: https://github.com/apache/lucene/pull/12434#issuecomment-1633057341

   @jpountz I took another shot at the KnnResults interface. I restricted the 
abstract and `@Override` methods to narrow the API. Additionally, I 
disconnected it from the queue, but it still has a queue object internally that 
sub-classes can utilize.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] almogtavor commented on issue #12406: Register nested queries (ToParentBlockJoinQuery) to Lucene Monitor

2023-07-12 Thread via GitHub


almogtavor commented on issue #12406:
URL: https://github.com/apache/lucene/issues/12406#issuecomment-1633133797

   @romseygeek @dweiss @uschindler I'd love to get feedback from you on the 
subject
   
   @jpountz @benwtrent I saw that Elasticsearch does have the option of 
percolating nested queries. I wonder if its got the simillar optimizations of 
Lucene Monitor, or is it just query that gets executed every X seconds. Solr 
doesn't have an equivalent. @epugh 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #12421: Concurrent hnsw graph and builder, take two

2023-07-12 Thread via GitHub


benwtrent commented on code in PR #12421:
URL: https://github.com/apache/lucene/pull/12421#discussion_r1261778701


##
lucene/core/src/java/org/apache/lucene/util/hnsw/ConcurrentHnswGraphBuilder.java:
##
@@ -0,0 +1,465 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.hnsw;
+
+import static java.lang.Math.log;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.Objects;
+import java.util.Set;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.CompletionException;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ConcurrentSkipListSet;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.util.concurrent.Semaphore;
+import java.util.concurrent.ThreadLocalRandom;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.Supplier;
+import org.apache.lucene.index.VectorEncoding;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.util.GrowableBitSet;
+import org.apache.lucene.util.InfoStream;
+import org.apache.lucene.util.NamedThreadFactory;
+import org.apache.lucene.util.ThreadInterruptedException;
+import org.apache.lucene.util.hnsw.ConcurrentOnHeapHnswGraph.NodeAtLevel;
+
+/**
+ * Builder for Concurrent HNSW graph. See {@link HnswGraph} for a high level 
overview, and the
+ * comments to `addGraphNode` for details on the concurrent building approach.
+ *
+ * @param  the type of vector
+ */
+public class ConcurrentHnswGraphBuilder {
+
+  /** Default number of maximum connections per node */
+  public static final int DEFAULT_MAX_CONN = 16;
+
+  /**
+   * Default number of the size of the queue maintained while searching during 
a graph construction.
+   */
+  public static final int DEFAULT_BEAM_WIDTH = 100;
+
+  /** A name for the HNSW component for the info-stream */
+  public static final String HNSW_COMPONENT = "HNSW";
+
+  private final int beamWidth;
+  private final double ml;
+  private final ExplicitThreadLocal scratchNeighbors;
+
+  private final VectorSimilarityFunction similarityFunction;
+  private final VectorEncoding vectorEncoding;
+  private final RandomAccessVectorValues vectors;
+  private final ExplicitThreadLocal> graphSearcher;
+  private final ExplicitThreadLocal beamCandidates;
+
+  final ConcurrentOnHeapHnswGraph hnsw;
+  private final ConcurrentSkipListSet insertionsInProgress =
+  new ConcurrentSkipListSet<>();
+
+  private InfoStream infoStream = InfoStream.getDefault();
+
+  // we need two sources of vectors in order to perform diversity check 
comparisons without
+  // colliding
+  private final RandomAccessVectorValues vectorsCopy;
+
+  /** This is the "native" factory for ConcurrentHnswGraphBuilder. */
+  public static  ConcurrentHnswGraphBuilder create(
+  RandomAccessVectorValues vectors,
+  VectorEncoding vectorEncoding,
+  VectorSimilarityFunction similarityFunction,
+  int M,
+  int beamWidth)
+  throws IOException {
+return new ConcurrentHnswGraphBuilder<>(
+vectors, vectorEncoding, similarityFunction, M, beamWidth);
+  }
+
+  /**
+   * Reads all the vectors from vector values, builds a graph connecting them 
by their dense
+   * ordinals, using the given hyperparameter settings, and returns the 
resulting graph.
+   *
+   * @param vectors the vectors whose relations are represented by the graph - 
must provide a
+   * different view over those vectors than the one used to add via 
addGraphNode.
+   * @param M – graph fanout parameter used to calculate the maximum number of 
connections a node
+   * can have – M on upper layers, and M * 2 on the lowest level.
+   * @param beamWidth the size of the beam search to use when finding nearest 
neighbors.
+   */
+  public ConcurrentHnswGraphBuilder(
+  RandomAccessVectorValues vectors,
+  VectorEncoding vectorEncoding,
+  VectorSimilarityFunction similarityFunction,
+  int M,
+  int beamWidth)
+  throws IOException {
+this.vectors = vector

[GitHub] [lucene] mayya-sharipova opened a new pull request, #12436: Move max vector dims limit to Codec

2023-07-12 Thread via GitHub


mayya-sharipova opened a new pull request, #12436:
URL: https://github.com/apache/lucene/pull/12436

   Move vector max dimension limits enforcement into the default Codec's 
   KnnVectorsFormat implementation. This allows different implementation 
   of knn search algorithms define their own limits of a maximum 
   vector dimensions that they can handle.
   
   Closes #12309
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shubhamvishu commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection

2023-07-12 Thread via GitHub


shubhamvishu commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1633489589

   > This is a situation where we really cannot sort on behalf of the caller, 
so it might be a bit confusing/trappy to sort some flavors of this method but 
not others? Maybe it's best to leave these methods as they are?
   
   Agreed @gsmiller! Yes it does seem maybe its better leave this as is.
   
   > we could look at changing the assert on line 276 of StringsToAutomaton to 
throw an explicit IllegalArgumentException so that we don't silently built a 
corrupt automaton on unordered input (with asserts disabled). There would add 
overhead since we have to now keep track of the previous term all the time, but 
maybe it's worth benchmarking and considering this change?
   
   I like the idea to throw IAE if wrong input is provided. I think this would 
only affect the cases with assert disabled? otherwise with asserts enabled we 
anyways always keep track of previous term.
   
   > maybe we can relax StringsToAutomaton#build(Collection, boolean) 
to StringsToAutomaton#build(Iterable input, boolean). This change 
will also make it more consistent with the build(BytesRefIterator input, 
boolean asBinary) method.
   
   Good point @gautamworah96. `Iterable` seems like a better way instead of 
`Collection`  and would indeed be consistent with other method. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shubhamvishu commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection

2023-07-12 Thread via GitHub


shubhamvishu commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1633502769

   On the same note, since both the methods expects `Iterable` or `Iterators` 
why do we even need 2 separate methods here which are doing exactly the same 
thing i.e. iterating over the `ByteRef`'s and adding to automaton. 
   
   It would have been much better  if `BytesRefIterator` implemented the 
`Iterable`, `Iterator` interfaces in which case we could just have one method 
`StringsToAutomaton#build(Iterable)` which takes an `Iterable`. I don't see why 
do we even have a separate `BytesRefIterator` interface and not just using 
`Iterator` instead(maybe because its legacy code?) or is it that I'm missing 
something important here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org