pvary commented on PR #12298: URL: https://github.com/apache/iceberg/pull/12298#issuecomment-2678012816
> While I think the goal here is a good one, the implementation looks too complex to be workable in its current form. I'm happy that we agree with the goals. I created a PR to start the conversation. If there are willing reviewers we can introduce more invasive changes to archive a better API. I'm all for it! > The primary issue that we currently have is adapting object models (like Iceber's internal `StructLike`, Spark's `InternalRow`, or Flink's `RowData`) to file formats so that you can separately write object model to format glue code and have it work throughout support for an engine. I think we need to keep this direct transformations to prevent the performance loss which would be caused by multiple transformations between object model -> common model -> file format. We have a matrix of transformation which we need to encode somewhere: | Source | Target | | ------------- | ------------- | | Parquet | StructLike | | Parquet | InternalRow | | Parquet | RowData | | Parquet | Arrow | | Avro | ... | | ORC | ... | > [..] > * Rather than a switch, the format is passed to create the builder > * There is no longer a callback passed to create readers for the object model (`createResolvingReader`) The InternalData reader has one advantage over the data file readers/writers. The internal object model is static for these readers/writers. For the DataFile readers/writers we have multiple object models to handle. > [..] > I think that the next steps are to focus on making this a lot simpler, and there are some good ways to do that: > > * Focus on removing boilerplate and hiding the internals. For instance, `Key`, if needed, should be an internal abstraction and not complexity that is exposed to callers If we allow adding new builders for the file formats we can remove a good chunk of the boilerplate code. Let me see how this would look like > * The format-specific data and delete file builders typically wrap an appender builder. Is there a way to handle just the reader builder and appender builder? We need to refactor the Avro positional delete write for this, or add a positionalWriterFunc. Also need to consider that the format specific configurations which are different for the appenders and the delete files (DELETE_PARQUET_ROW_GROUP_SIZE_BYTES vs. PARQUET_ROW_GROUP_SIZE_BYTES) > * Is the extra "service" abstraction helpful? If we are ok with having a new Builder for the readers/writers, then we don't need the service. It was needed to keep the current APIs and the new APIs compatible. > * Remove `ServiceLoader` and use a simpler solution. I think that formats could simply register themselves like we do for `InternalData`. I think it would be fine to have a trade-off that Iceberg ships with a list of known formats that can be loaded, and if you want to replace that list it's at your own risk. Will do > * Standardize more across the builders for `FileFormat`. How `idToConstant` is handled is a good example. That should be passed to the builder instead of making the whole API more complicated. Projection is the same. Will see what could be arcived -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org