Re: [PR] Parquet, Core: Allows Internal Parquet Readers to use Custom Types [iceberg]

via GitHub Thu, 18 Sep 2025 06:53:14 -0700


pvary commented on code in PR #14040:
URL: https://github.com/apache/iceberg/pull/14040#discussion_r2359333485



##########
parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java:
##########
@@ -1281,12 +1304,24 @@ public ReadBuilder withNameMapping(NameMapping 
newNameMapping) {
 
     @Override
     public ReadBuilder setRootType(Class<? extends StructLike> rootClass) {
-      throw new UnsupportedOperationException("Custom types are not yet 
supported");
+      Preconditions.checkArgument(
+          this.internalReader != null, "Cannot set root type without using an 
Internal Reader");
+      Preconditions.checkArgument(
+          this.readerFunc == null && this.readerFuncWithSchema == null,
+          "Setting root type is not compatible with setting a reader 
function");
+      internalReader.setRootType(rootClass);

Review Comment:
   We directly use the Iceberg `schema` in the readers for column pruning and 
filter pushdown, so we should store that in both the ReadBuilder, and in the 
ParquetReaderFunction. Other than that, it could work.
   
   I think we should deprecate and remove the old setters for 
readerFunctions/writerFunctions which could help removing the clutter around 
this.
   
   We have to decide if we really want to formalize the 
ReaderFunction/WriterFunction API.
   Currently I have this in the FileFormat API proposal:
   ```
           new ParquetFormatModel<InternalRow, StructType, 
DeleteFilter<InternalRow>>(
               InternalRow.class,
               (schema, messageType, constantFieldAccessors) ->
                   SparkParquetReaders.buildReader(...),
               (icebergSchema, messageType, inputType) ->
                   SparkParquetWriters.buildWriter(
                       SparkSchemaUtil.convert(icebergSchema), messageType))
   
           new ParquetFormatModel<ColumnarBatch, StructType, 
DeleteFilter<InternalRow>>(
               ColumnarBatch.class,
               (schema, messageType, constantFieldAccessors, deleteFilter, 
config) ->
                   VectorizedSparkParquetReaders.buildReader(...))
   ```
   
   We can formalize the ReaderFunction/WriterFunction, but I have the following 
concerns:
   - This will be inherently FileFormat specific (See: messageType, 
deleteFilter, config)
   - Always prone to adding more fields - I don't see why we would not share 
any of the builder parameters
   
   I still prefer to have the readerFunctions/writerFunctions, as this makes 
clear which of the parameters are really used there, but can be convinced 
otherwise.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Parquet, Core: Allows Internal Parquet Readers to use Custom Types [iceberg]

Reply via email to