[GitHub] [lucene-solr] msokolov commented on a change in pull request #1930: LUCENE-9322: add VectorValues to new Lucene90 codec

GitBox Fri, 16 Oct 2020 12:23:20 -0700


msokolov commented on a change in pull request #1930:
URL: https://github.com/apache/lucene-solr/pull/1930#discussion_r506677596




##########
File path: lucene/core/src/java/org/apache/lucene/index/VectorValues.java
##########
@@ -0,0 +1,273 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.index;
+
+import java.io.IOException;
+
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.BytesRef;
+
+/**
+ * This class provides access to per-document floating point vector values 
indexed as {@link
+ * org.apache.lucene.document.VectorField}.
+ */
+public abstract class VectorValues extends DocIdSetIterator {
+
+  /** The maximum length of a vector */
+  public static int MAX_DIMENSIONS = 1024;
+
+  /** Sole constructor */
+  protected VectorValues() {}
+
+  /**
+   * Return the dimension of the vectors
+   */
+  public abstract int dimension();
+
+  /**
+   * TODO: should we use cost() for this? We rely on its always being exactly 
the number
+   * of documents having a value for this field, which is not guaranteed by 
the cost() contract,
+   * but in all the implementations so far they are the same.
+   * @return the number of vectors returned by this iterator
+   */
+  public abstract int size();
+
+  /**
+   * Return the score function used to compare these vectors
+   */
+  public abstract ScoreFunction scoreFunction();
+
+  /**
+   * Return the vector value for the current document ID.
+   * It is illegal to call this method when the iterator is not positioned: 
before advancing, or after failing to advance.
+   * The returned array may be shared across calls, re-used, and modified as 
the iterator advances.
+   * @return the vector value
+   */
+  public abstract float[] vectorValue() throws IOException;

Review comment:
       If you are iterating, you can keep track by counting. On the other hand, 
you wouldn't need to since the only purpose for the ordinal is to retrieve the 
vector that you already have. So I think getting the ordinal for  the *current* 
docId is not really all that helpful.
   
   In theory one might want to get an ordinal for some arbitrary docId. I guess 
the API is sort of incomplete without it - it offers random access, but only by 
an opaque ordinal. You can of course iterate over all the docs and build your 
own map, but that is kind of unhelpful.
   
   However supporting this comes at some additional cost. It's not required to 
support knnSearch since the way that works is to search for the best ordinal 
and then map that to a docid, but internally we do not maintain any 
docid->ordinal mapping. We can get an answer using binary search in the 
ordToDoc map, but I wonder if we should expose that. WDYT?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on a change in pull request #1930: LUCENE-9322: add VectorValues to new Lucene90 codec

Reply via email to