rdblue commented on PR #12298:
URL: https://github.com/apache/iceberg/pull/12298#issuecomment-2675856645
While I think the goal here is a good one, the implementation looks too
complex to be workable in its current form.
The primary issue that we currently have is adapting object models (like
Iceber's internal `StructLike`, Spark's `InternalRow`, or Flink's `RowData`) to
file formats so that you can separately write object model to format glue code
and have it work throughout support for an engine. I think a diff from the
`InternalData` PR demonstrates it pretty well:
```diff
- switch (format) {
- case AVRO:
- AvroIterable<ManifestEntry<F>> reader =
- Avro.read(file)
-
.project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields)))
- .createResolvingReader(this::newReader)
- .reuseContainers()
- .build();
+ CloseableIterable<ManifestEntry<F>> reader =
+ InternalData.read(format, file)
+
.project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields)))
+ .reuseContainers()
+ .build();
- addCloseable(reader);
+ addCloseable(reader);
- return CloseableIterable.transform(reader,
inheritableMetadata::apply);
+ return CloseableIterable.transform(reader, inheritableMetadata::apply);
-
- default:
- throw new UnsupportedOperationException("Invalid format for
manifest file: " + format);
- }
```
This shows:
* Rather than a switch, the format is passed to create the builder
* There is no longer a callback passed to create readers for the object
model (`createResolvingReader`)
In this PR, there are a lot of other changes as well. I'm looking at one of
the simpler Spark cases in the [row
reader](https://github.com/apache/iceberg/pull/12298/files#diff-18397004b7d30d1611d6c9c49dadc429a6495db4cdafea1951cc181f5af01226).
The builder is initialized from `DataFileServiceRegistry` and now requires a
format, class name, file, projection, and constant map:
```java
return DataFileServiceRegistry.readerBuilder(
format, InternalRow.class.getName(), file, projection,
idToConstant)
```
There are also new static classes in the file. Each creates a new service
and each service creates the builder and object model:
```java
public static class AvroReaderService implements
DataFileServiceRegistry.ReaderService {
@Override
public DataFileServiceRegistry.Key key() {
return new DataFileServiceRegistry.Key(FileFormat.AVRO,
InternalRow.class.getName());
}
@Override
public ReaderBuilder builder(
InputFile inputFile,
Schema readSchema,
Map<Integer, ?> idToConstant,
DeleteFilter<?> deleteFilter) {
return Avro.read(inputFile)
.project(readSchema)
.createResolvingReader(schema ->
SparkPlannedAvroReader.create(schema, idToConstant));
}
```
The `createResolvingReader` line is still there, just moved into its own
service class instead of in branches of a switch statement.
In addition, there are now a lot more abstractions:
* A builder for creating an appender for a file format
* A builder for creating a data file writer for a file format
* A builder for creating an equality delete writer for a file format
* A builder for creating a position delete writer for a file format
* A builder for creating a reader for a file format
* A "service" registry (what is a service?)
* A "key"
* A writer service
* A reader service
I think that the next steps are to focus on making this a lot simpler, and
there are some good ways to do that:
* Focus on removing boilerplate and hiding the internals. For instance,
`Key`, if needed, should be an internal abstraction and not complexity that is
exposed to callers
* The format-specific data and delete file builders typically wrap an
appender builder. Is there a way to handle just the reader builder and appender
builder?
* Is the extra "service" abstraction helpful?
* Remove `ServiceLoader` and use a simpler solution. I think that formats
could simply register themselves like we do for `InternalData`. I think it
would be fine to have a trade-off that Iceberg ships with a list of known
formats that can be loaded, and if you want to replace that list it's at your
own risk.
* Standardize more across the builders for `FileFormat`. How `idToConstant`
is handled is a good example. That should be passed to the builder instead of
making the whole API more complicated. Projection is the same.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]