Re: [PR] WIP: Interface based DataFile reader and writer API [iceberg]

via GitHub Fri, 21 Feb 2025 16:11:01 -0800


rdblue commented on PR #12298:
URL: https://github.com/apache/iceberg/pull/12298#issuecomment-2675856645


   While I think the goal here is a good one, the implementation looks too 
complex to be workable in its current form.
   
   The primary issue that we currently have is adapting object models (like 
Iceber's internal `StructLike`, Spark's `InternalRow`, or Flink's `RowData`) to 
file formats so that you can separately write object model to format glue code 
and have it work throughout support for an engine. I think a diff from the 
`InternalData` PR demonstrates it pretty well:
   
   ```diff
   -    switch (format) {
   -      case AVRO:
   -        AvroIterable<ManifestEntry<F>> reader =
   -            Avro.read(file)
   -                
.project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields)))
   -                .createResolvingReader(this::newReader)
   -                .reuseContainers()
   -                .build();
   +    CloseableIterable<ManifestEntry<F>> reader =
   +        InternalData.read(format, file)
   +            
.project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields)))
   +            .reuseContainers()
   +            .build();
    
   -        addCloseable(reader);
   +    addCloseable(reader);
    
   -        return CloseableIterable.transform(reader, 
inheritableMetadata::apply);
   +    return CloseableIterable.transform(reader, inheritableMetadata::apply);
   -
   -      default:
   -        throw new UnsupportedOperationException("Invalid format for 
manifest file: " + format);
   -    }
   ```
   
   This shows:
   * Rather than a switch, the format is passed to create the builder
   * There is no longer a callback passed to create readers for the object 
model (`createResolvingReader`)
   
   In this PR, there are a lot of other changes as well. I'm looking at one of 
the simpler Spark cases in the [row 
reader](https://github.com/apache/iceberg/pull/12298/files#diff-18397004b7d30d1611d6c9c49dadc429a6495db4cdafea1951cc181f5af01226).
   
   The builder is initialized from `DataFileServiceRegistry` and now requires a 
format, class name, file, projection, and constant map:
   ```java
       return DataFileServiceRegistry.readerBuilder(
               format, InternalRow.class.getName(), file, projection, 
idToConstant)
   ```
   
   There are also new static classes in the file. Each creates a new service 
and each service creates the builder and object model:
   
   ```java
     public static class AvroReaderService implements 
DataFileServiceRegistry.ReaderService {
       @Override
       public DataFileServiceRegistry.Key key() {
         return new DataFileServiceRegistry.Key(FileFormat.AVRO, 
InternalRow.class.getName());
       }
   
       @Override
       public ReaderBuilder builder(
           InputFile inputFile,
           Schema readSchema,
           Map<Integer, ?> idToConstant,
           DeleteFilter<?> deleteFilter) {
         return Avro.read(inputFile)
             .project(readSchema)
             .createResolvingReader(schema -> 
SparkPlannedAvroReader.create(schema, idToConstant));
       }
   ```
   
   The `createResolvingReader` line is still there, just moved into its own 
service class instead of in branches of a switch statement.
   
   In addition, there are now a lot more abstractions:
   * A builder for creating an appender for a file format
   * A builder for creating a data file writer for a file format
   * A builder for creating an equality delete writer for a file format
   * A builder for creating a position delete writer for a file format
   * A builder for creating a reader for a file format
   * A "service" registry (what is a service?)
   * A "key"
   * A writer service
   * A reader service
   
   I think that the next steps are to focus on making this a lot simpler, and 
there are some good ways to do that:
   
   * Focus on removing boilerplate and hiding the internals. For instance, 
`Key`, if needed, should be an internal abstraction and not complexity that is 
exposed to callers
   * The format-specific data and delete file builders typically wrap an 
appender builder. Is there a way to handle just the reader builder and appender 
builder?
   * Is the extra "service" abstraction helpful?
   * Remove `ServiceLoader` and use a simpler solution. I think that formats 
could simply register themselves like we do for `InternalData`. I think it 
would be fine to have a trade-off that Iceberg ships with a list of known 
formats that can be loaded, and if you want to replace that list it's at your 
own risk.
   * Standardize more across the builders for `FileFormat`. How `idToConstant` 
is handled is a good example. That should be passed to the builder instead of 
making the whole API more complicated. Projection is the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] WIP: Interface based DataFile reader and writer API [iceberg]

Reply via email to