Re: [PR] Spark: Encapsulate parquet objects for Comet [iceberg]

via GitHub Fri, 07 Nov 2025 09:03:05 -0800


pvary commented on code in PR #13786:
URL: https://github.com/apache/iceberg/pull/13786#discussion_r2504608108



##########
parquet/src/main/java/org/apache/iceberg/parquet/VectorizedParquetReaderFactory.java:
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.parquet;
+
+import java.nio.ByteBuffer;
+import java.util.Map;
+import java.util.function.Function;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.InputFile;
+import org.apache.iceberg.mapping.NameMapping;
+import org.apache.parquet.ParquetReadOptions;
+import org.apache.parquet.schema.MessageType;
+
+/**
+ * Service Provider Interface (SPI) for creating custom vectorized Parquet 
readers.
+ *
+ * <p>Implementations of this interface can be loaded at runtime using Java's 
{@link
+ * java.util.ServiceLoader} mechanism. To register an implementation, create a 
file named {@code
+ * 
META-INF/services/org.apache.iceberg.parquet.VectorizedParquetReaderFactory} 
containing the fully
+ * qualified class name of the implementation.
+ *
+ * <p>This allows for pluggable vectorized reader implementations (e.g., 
Comet, Arrow, Velox)
+ * without requiring the core parquet module to depend on specific execution 
engines.
+ */
+public interface VectorizedParquetReaderFactory {
+
+  /**
+   * Returns the unique identifier for this reader factory.
+   *
+   * <p>This name is used to select the reader factory via configuration. For 
example, "comet" for
+   * the Comet vectorized reader.
+   *
+   * @return the unique name for this factory
+   */
+  String name();
+
+  /**
+   * Creates a vectorized parquet reader with the given configuration.
+   *
+   * @param file the input file to read
+   * @param schema the expected schema for the data
+   * @param options parquet read options
+   * @param batchedReaderFunc function to create a VectorizedReader from a 
MessageType
+   * @param mapping name mapping for schema evolution
+   * @param filter filter expression to apply during reading
+   * @param reuseContainers whether to reuse containers for records
+   * @param caseSensitive whether column name matching should be case-sensitive
+   * @param maxRecordsPerBatch maximum number of records per batch
+   * @param properties additional properties for reader configuration
+   * @param start optional start position for reading
+   * @param length optional length to read
+   * @param fileEncryptionKey optional encryption key for the file
+   * @param fileAADPrefix optional AAD prefix for encryption
+   * @param <T> the type of records returned by the reader
+   * @return a closeable iterable of records
+   */
+  <T> CloseableIterable<T> createReader(

Review Comment:
   This is ugly. For this many parameters we usually create a builder.
   Like the WriteBuilder in this PR: 
https://github.com/apache/iceberg/pull/12774/files
   
   I have been toying with the idea of creating a separate FormatModel for 
Comet and register it to the FormatModelRegistry. The rationale was, that we 
basically create an alternate reader for Parquet vectorized reads.
   
   We could still do it if the returned batch class is different.
   Like:
   ```
   public class SparkFormatModels {
       FormatModelRegistry.register(
           new ParquetFormatModel<ColumnarBatch, StructType, 
DeleteFilter<InternalRow>>(
               ColumnarBatch.class, StructType.class, 
VectorizedSparkParquetReaders::buildReader));
   
       FormatModelRegistry.register(
           new ParquetFormatModel<ColumnarBatch, StructType, 
DeleteFilter<InternalRow>>(
               CometColumnarBatch.class, StructType.class, 
VectorizedSparkParquetReaders::buildCometReader));
   }
   ```
   
   And if they are compatible, then we can rely on them when getting the 
builder:
   ```
   abstract class BaseBatchReader<T extends ScanTask> extends 
BaseReader<ColumnarBatch, T> {
   [..]
       ReadBuilder readBuilder =
           FormatModelRegistry.readBuilder(format, config.isComet() ? 
CometColumnarBatch. class : ColumnarBatch.class, inputFile);
   [..]
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark: Encapsulate parquet objects for Comet [iceberg]

Reply via email to