msfroh commented on code in PR #15990:
URL: https://github.com/apache/lucene/pull/15990#discussion_r3164482949


##########
lucene/core/src/java/org/apache/lucene/document/column/VectorTupleCursor.java:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document.column;
+
+import org.apache.lucene.search.DocIdSetIterator;
+
+/**
+ * A tuple cursor over a {@link VectorColumn}. Yields {@code (docID, 
vectorValue)} pairs.
+ * Batch-local doc-ids are returned in strictly increasing order (vectors are 
single-valued).
+ *
+ * @param <T> the vector array type, either {@code float[]} or {@code byte[]}
+ * @lucene.experimental
+ */
+public abstract class VectorTupleCursor<T> {

Review Comment:
   Does this necessarily need to be a dedicated vector cursor?
   
   Or could this just be `ObjectTupleCursor<T>`? Just from an interface 
standpoint, `T` can be any non-primitive type.
   
   By the same token, `BinaryTupleCursor` is pretty much the same thing with 
`T` as `BytesRef`.
   
   Maybe these `TupleCursor` classes don't need to be directly connected to the 
associated `Column` classes. Instead, we could do specialized primitive 
`TupleCursors` and a single generic `ObjectTupleCursor<T>`?



##########
lucene/core/src/java/org/apache/lucene/index/IndexingChain.java:
##########
@@ -764,10 +1205,30 @@ private void initializeFieldInfo(PerField pf) throws 
IOException {
 
   /** Index each field Returns {@code true}, if we are indexing a unique field 
with postings */
   private boolean processField(int docID, IndexableField field, PerField pf) 
throws IOException {
+    boolean indexedField = invertAndStore(docID, field, pf);
+    IndexableFieldType fieldType = field.fieldType();
+    DocValuesType dvType = fieldType.docValuesType();
+    if (dvType != DocValuesType.NONE) {
+      indexDocValue(docID, pf, dvType, field);
+    }
+    if (fieldType.pointDimensionCount() != 0) {
+      pf.pointValuesWriter.addPackedValue(docID, field.binaryValue());
+    }
+    if (fieldType.vectorDimension() != 0) {
+      indexVectorValue(docID, pf, fieldType.vectorEncoding(), field);
+    }
+    return indexedField;
+  }
+
+  /**
+   * Inverts indexed fields and writes stored fields. Shared by the single-doc 
row path ({@link
+   * #processField}) and the column-batch row pass ({@link 
#processRowColumns}). Returns {@code
+   * true} if this is a unique indexed field with postings.
+   */
+  private boolean invertAndStore(int docID, IndexableField field, PerField pf) 
throws IOException {

Review Comment:
   While stored fields definitely make sense as a row-oriented thing, I'm 
wondering if there's anything clever we can do on the "invert" side if we want 
to process a bunch of values for the same field.
   
   Since the resulting data structures are field-oriented, it *feels* to me 
like there should be something we can do, but I haven't had enough coffee yet 
to have a clever idea.



##########
lucene/core/src/java/org/apache/lucene/index/IndexingChain.java:
##########
@@ -681,6 +693,435 @@ private void oversizeDocFields() {
     docFields = newDocFields;
   }
 
+  /**
+   * Process a column-oriented batch of documents. Iterates the batch's 
columns, validates each
+   * column's field type, and feeds values to the appropriate DocValuesWriter.
+   *
+   * @param baseDocID the segment-level doc ID for the first document in the 
batch (batch-local doc
+   *     0 maps to this value)
+   * @param columnBatch the column-oriented batch
+   */
+  void processBatch(int baseDocID, ColumnBatch columnBatch) throws IOException 
{
+    final int numDocs = columnBatch.numDocs();
+    boolean hasRowColumns = false;
+
+    // First pass: validate all column schemas and initialize field infos
+    for (Column column : columnBatch.columns()) {
+      final String fieldName = column.name();
+      final IndexableFieldType fieldType = column.fieldType();
+
+      ColumnValidation.validateColumnHasIndexingFeature(fieldName, fieldType);
+
+      if (column instanceof BinaryColumn bc) {
+        ColumnValidation.validateBinaryColumn(bc, fieldType);
+      } else if (column instanceof LongColumn lc) {
+        ColumnValidation.validateLongColumn(lc, fieldType);
+      } else if (column instanceof VectorColumn<?> vc) {
+        ColumnValidation.validateVectorColumn(vc, fieldType);
+      }

Review Comment:
   IMO, not worth addressing in this PR, but I wonder if this repeated 
validation will eventually take a non-trivial amount of compute, especially for 
a large schema. We can probably define a "schema" class (I'm thinking of 
Arrow's `VectorSchemaRoot`), which can be "frozen" after validation. Then each 
batch can reference the same schema and we can skip validation on subsequent 
batches.
   
   Just a thought -- for now I think the per-batch validation should be fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to