[GitHub] [lucene] uschindler commented on a diff in pull request #12436: Move max vector dims limit to Codec

via GitHub Thu, 27 Jul 2023 04:58:59 -0700


uschindler commented on code in PR #12436:
URL: https://github.com/apache/lucene/pull/12436#discussion_r1276162592



##########
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsFormat.java:
##########
@@ -76,6 +79,19 @@ public static KnnVectorsFormat forName(String name) {
   /** Returns a {@link KnnVectorsReader} to read the vectors from the index. */
   public abstract KnnVectorsReader fieldsReader(SegmentReadState state) throws 
IOException;
 
+  /**
+   * Returns the maximum number of vector dimensions supported by this codec 
for the given field
+   * name
+   *
+   * <p>Codecs should override this method to specify the maximum number of 
dimensions they support.
+   *
+   * @param fieldName the field name
+   * @return the maximum number of vector dimensions.
+   */
+  public int getMaxDimensions(String fieldName) {

Review Comment:
   In main branch this should be abstract, only in 9.x we should have a default 
impl.



##########
lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java:
##########
@@ -478,18 +473,42 @@ public void testIllegalDimensionTooLarge() throws 
Exception {
     try (Directory dir = newDirectory();
         IndexWriter w = new IndexWriter(dir, newIndexWriterConfig())) {
       Document doc = new Document();
-      expectThrows(
-          IllegalArgumentException.class,
-          () ->
-              doc.add(
-                  new KnnFloatVectorField(
-                      "f",
-                      new float[FloatVectorValues.MAX_DIMENSIONS + 1],
-                      VectorSimilarityFunction.DOT_PRODUCT)));
+      doc.add(
+          new KnnFloatVectorField(
+              "f",
+              new float[getVectorsMaxDimensions("f") + 1],
+              VectorSimilarityFunction.DOT_PRODUCT));
+      Exception exc = expectThrows(IllegalArgumentException.class, () -> 
w.addDocument(doc));

Review Comment:
   So basically the check is now delayed to `addDocument()`? Great!
   
   I was afraid that the check comes delayed while indexing of flushing is 
happening. So to me this looks good.



##########
lucene/core/src/java/org/apache/lucene/index/IndexingChain.java:
##########
@@ -621,6 +621,12 @@ private void initializeFieldInfo(PerField pf) throws 
IOException {
       final Sort indexSort = indexWriterConfig.getIndexSort();
       validateIndexSortDVType(indexSort, pf.fieldName, s.docValuesType);
     }
+    if (s.vectorDimension != 0) {
+      validateMaxVectorDimension(
+          pf.fieldName,
+          s.vectorDimension,
+          indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions());
+    }

Review Comment:
   Yes this looks fine. The hashtable lookup is no issue at all.
   
   As getMaxDimensions() is a public codec method, we can let us do the check 
here. A separate validateFieldDims() is not needed.
   
   I only don't like the long call chain to actually get the vectors format 
from the indexWriterConfig. But I can live with that.



##########
lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java:
##########
@@ -80,6 +80,11 @@ public KnnVectorsReader fieldsReader(SegmentReadState state) 
throws IOException
     return new FieldsReader(state);
   }
 
+  @Override

Review Comment:
   is this the only subclass of KnnVectorsFormat that we have?
   
   In addition, we should also add explicit number of dimensions in backwards 
codecs, because at the time when it was implemented 1024 was their default. In 
backwards codec the method should be final, IMHO.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on a diff in pull request #12436: Move max vector dims limit to Codec

Reply via email to