[GitHub] [lucene] benwtrent commented on a diff in pull request #11946: add similarity threshold for hnsw

GitBox Tue, 06 Dec 2022 06:02:56 -0800


benwtrent commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1040969450



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##########
@@ -37,6 +37,7 @@
  * @param <T> the type of query vector
  */
 public class HnswGraphSearcher<T> {
+  private final int UNBOUNDED_QUEUE_INIT_SIZE = 10_000;

Review Comment:
   Any research to indicate why this number was chosen? It seems silly that if 
a user provides `k = 10_001` it would have a queue bigger than `k = 
Integer.MAX_VALUE`.
   
   Technically, the max value here should be something like 
`ArrayUtil.MAX_ARRAY_LENGTH` But this eagerly allocates a `new 
long[heapSize];`. This is VERY costly.
   
   I would prefer a number with some significant reason behind it or some 
better way of queueing neighbors.



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##########
@@ -235,7 +312,7 @@ private NeighborQueue searchLevel(
     while (candidates.size() > 0 && results.incomplete() == false) {
       // get the best candidate (closest or best scoring)
       float topCandidateSimilarity = candidates.topScore();
-      if (topCandidateSimilarity < minAcceptedSimilarity) {
+      if (topCandidateSimilarity < minAcceptedSimilarity && results.size() >= 
topK) {
         break;
       }

Review Comment:
   I am not sure about this. This stops gathering results once its filled. This 
defeats the purpose of exploring the graph.
   
   Have you seen how this effects recall?



##########
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##########
@@ -232,8 +232,48 @@ public final PostingsEnum postings(Term term) throws 
IOException {
    * @return the k nearest neighbor documents, along with their 
(searchStrategy-specific) scores.
    * @lucene.experimental
    */
+  public final TopDocs searchNearestVectors(
+      String field, float[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException {
+    return searchNearestVectors(
+        field, target, k, Float.NEGATIVE_INFINITY, acceptDocs, visitedLimit);
+  }
+
+  /**
+   * Return the k nearest neighbor documents as determined by comparison of 
their vector values for
+   * this field, to the given vector, by the field's similarity function. The 
score of each document
+   * is derived from the vector similarity in a way that ensures scores are 
positive and that a
+   * larger score corresponds to a higher ranking.
+   *
+   * <p>The search is allowed to be approximate, meaning the results are not 
guaranteed to be the
+   * true k closest neighbors. For large values of k (for example when k is 
close to the total
+   * number of documents), the search may also retrieve fewer than k documents.
+   *
+   * <p>The returned {@link TopDocs} will contain a {@link ScoreDoc} for each 
nearest neighbor,
+   * sorted in order of their similarity to the query vector (decreasing 
scores). The {@link
+   * TotalHits} contains the number of documents visited during the search. If 
the search stopped
+   * early because it hit {@code visitedLimit}, it is indicated through the 
relation {@code
+   * TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO}.
+   *
+   * @param field the vector field to search
+   * @param target the vector-valued query
+   * @param k the number of docs to return (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   Would it be possible for this threshold to be an actual distance? My concern 
here is that for things like `byteVectors`, dot-product scores are insanely 
small (I think this is a design flaw in itself) and may be confusing to users 
who want a given "radius" but instead have to figure out a score related to 
their radius. 
   
   It would be prudent that IF we provided some filtering on a threshold within 
the search, that this threshold reflects vector distance directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] benwtrent commented on a diff in pull request #11946: add similarity threshold for hnsw

Reply via email to