pvary commented on code in PR #14040:
URL: https://github.com/apache/iceberg/pull/14040#discussion_r2359333485
##########
parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java:
##########
@@ -1281,12 +1304,24 @@ public ReadBuilder withNameMapping(NameMapping
newNameMapping) {
@Override
public ReadBuilder setRootType(Class<? extends StructLike> rootClass) {
- throw new UnsupportedOperationException("Custom types are not yet
supported");
+ Preconditions.checkArgument(
+ this.internalReader != null, "Cannot set root type without using an
Internal Reader");
+ Preconditions.checkArgument(
+ this.readerFunc == null && this.readerFuncWithSchema == null,
+ "Setting root type is not compatible with setting a reader
function");
+ internalReader.setRootType(rootClass);
Review Comment:
We directly use the Iceberg `schema` in the readers for column pruning and
filter pushdown, so we should store that in both the ReadBuilder, and in the
ParquetReaderFunction. Other than that, it could work.
I think we should deprecate and remove the old setters for
readerFunctions/writerFunctions which could help removing the clutter around
this.
We have to decide if we really want to formalize the
ReaderFunction/WriterFunction API.
Currently I have this in the FileFormat API proposal:
```
new ParquetFormatModel<InternalRow, StructType,
DeleteFilter<InternalRow>>(
InternalRow.class,
(schema, messageType, constantFieldAccessors) ->
SparkParquetReaders.buildReader(...),
(icebergSchema, messageType, inputType) ->
SparkParquetWriters.buildWriter(
SparkSchemaUtil.convert(icebergSchema), messageType))
new ParquetFormatModel<ColumnarBatch, StructType,
DeleteFilter<InternalRow>>(
ColumnarBatch.class,
(schema, messageType, constantFieldAccessors, deleteFilter,
config) ->
VectorizedSparkParquetReaders.buildReader(...))
```
We can formalize the ReaderFunction/WriterFunction, but I have the following
concerns:
- This will be inherently FileFormat specific (See: messageType,
deleteFilter, config)
- Always prone to adding more fields - I don't see why we would not share
any of the builder parameters
I still prefer to have the readerFunctions/writerFunctions, as this makes
clear which of the parameters are really used there, but can be convinced
otherwise.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]