rdblue commented on PR #12298: URL: https://github.com/apache/iceberg/pull/12298#issuecomment-2675856645
While I think the goal here is a good one, the implementation looks too complex to be workable in its current form. The primary issue that we currently have is adapting object models (like Iceber's internal `StructLike`, Spark's `InternalRow`, or Flink's `RowData`) to file formats so that you can separately write object model to format glue code and have it work throughout support for an engine. I think a diff from the `InternalData` PR demonstrates it pretty well: ```diff - switch (format) { - case AVRO: - AvroIterable<ManifestEntry<F>> reader = - Avro.read(file) - .project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields))) - .createResolvingReader(this::newReader) - .reuseContainers() - .build(); + CloseableIterable<ManifestEntry<F>> reader = + InternalData.read(format, file) + .project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields))) + .reuseContainers() + .build(); - addCloseable(reader); + addCloseable(reader); - return CloseableIterable.transform(reader, inheritableMetadata::apply); + return CloseableIterable.transform(reader, inheritableMetadata::apply); - - default: - throw new UnsupportedOperationException("Invalid format for manifest file: " + format); - } ``` This shows: * Rather than a switch, the format is passed to create the builder * There is no longer a callback passed to create readers for the object model (`createResolvingReader`) In this PR, there are a lot of other changes as well. I'm looking at one of the simpler Spark cases in the [row reader](https://github.com/apache/iceberg/pull/12298/files#diff-18397004b7d30d1611d6c9c49dadc429a6495db4cdafea1951cc181f5af01226). The builder is initialized from `DataFileServiceRegistry` and now requires a format, class name, file, projection, and constant map: ```java return DataFileServiceRegistry.readerBuilder( format, InternalRow.class.getName(), file, projection, idToConstant) ``` There are also new static classes in the file. Each creates a new service and each service creates the builder and object model: ```java public static class AvroReaderService implements DataFileServiceRegistry.ReaderService { @Override public DataFileServiceRegistry.Key key() { return new DataFileServiceRegistry.Key(FileFormat.AVRO, InternalRow.class.getName()); } @Override public ReaderBuilder builder( InputFile inputFile, Schema readSchema, Map<Integer, ?> idToConstant, DeleteFilter<?> deleteFilter) { return Avro.read(inputFile) .project(readSchema) .createResolvingReader(schema -> SparkPlannedAvroReader.create(schema, idToConstant)); } ``` The `createResolvingReader` line is still there, just moved into its own service class instead of in branches of a switch statement. In addition, there are now a lot more abstractions: * A builder for creating an appender for a file format * A builder for creating a data file writer for a file format * A builder for creating an equality delete writer for a file format * A builder for creating a position delete writer for a file format * A builder for creating a reader for a file format * A "service" registry (what is a service?) * A "key" * A writer service * A reader service I think that the next steps are to focus on making this a lot simpler, and there are some good ways to do that: * Focus on removing boilerplate and hiding the internals. For instance, `Key`, if needed, should be an internal abstraction and not complexity that is exposed to callers * The format-specific data and delete file builders typically wrap an appender builder. Is there a way to handle just the reader builder and appender builder? * Is the extra "service" abstraction helpful? * Remove `ServiceLoader` and use a simpler solution. I think that formats could simply register themselves like we do for `InternalData`. I think it would be fine to have a trade-off that Iceberg ships with a list of known formats that can be loaded, and if you want to replace that list it's at your own risk. * Standardize more across the builders for `FileFormat`. How `idToConstant` is handled is a good example. That should be passed to the builder instead of making the whole API more complicated. Projection is the same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org