Re: [PR] WIP: Interface based DataFile reader and writer API [iceberg]

via GitHub Mon, 24 Feb 2025 02:35:56 -0800


pvary commented on PR #12298:
URL: https://github.com/apache/iceberg/pull/12298#issuecomment-2678012816


   > While I think the goal here is a good one, the implementation looks too 
complex to be workable in its current form.
   
   I'm happy that we agree with the goals. I created a PR to start the 
conversation. If there are willing reviewers we can introduce more invasive 
changes to archive a better API. I'm all for it!
   
   > The primary issue that we currently have is adapting object models (like 
Iceber's internal `StructLike`, Spark's `InternalRow`, or Flink's `RowData`) to 
file formats so that you can separately write object model to format glue code 
and have it work throughout support for an engine.
   
   I think we need to keep this direct transformations to prevent the 
performance loss which would be caused by multiple transformations between 
object model -> common model -> file format.
   
   We have a matrix of transformation which we need to encode somewhere:
   |  Source | Target |
   | ------------- | ------------- |
   | Parquet  | StructLike  |
   | Parquet  | InternalRow  |
   | Parquet  | RowData  |
   | Parquet  | Arrow  |
   | Avro  | ... |
   | ORC  | ... |
   
   > [..]
   > * Rather than a switch, the format is passed to create the builder
   > * There is no longer a callback passed to create readers for the object 
model (`createResolvingReader`)
   
   The InternalData reader has one advantage over the data file 
readers/writers. The internal object model is static for these readers/writers. 
For the DataFile readers/writers we have multiple object models to handle.
   
   > [..]
   > I think that the next steps are to focus on making this a lot simpler, and 
there are some good ways to do that:
   > 
   > * Focus on removing boilerplate and hiding the internals. For instance, 
`Key`, if needed, should be an internal abstraction and not complexity that is 
exposed to callers
   
   If we allow adding new builders for the file formats we can remove a good 
chunk of the boilerplate code. Let me see how this would look like
   
   > * The format-specific data and delete file builders typically wrap an 
appender builder. Is there a way to handle just the reader builder and appender 
builder?
   
   We need to refactor the Avro positional delete write for this, or add a 
positionalWriterFunc. Also need to consider that the format specific 
configurations which are different for the appenders and the delete files 
(DELETE_PARQUET_ROW_GROUP_SIZE_BYTES vs. PARQUET_ROW_GROUP_SIZE_BYTES)
   
   > * Is the extra "service" abstraction helpful?
   
   If we are ok with having a new Builder for the readers/writers, then we 
don't need the service. It was needed to keep the current APIs and the new APIs 
compatible.
   
   > * Remove `ServiceLoader` and use a simpler solution. I think that formats 
could simply register themselves like we do for `InternalData`. I think it 
would be fine to have a trade-off that Iceberg ships with a list of known 
formats that can be loaded, and if you want to replace that list it's at your 
own risk.
   
   Will do
   
   > * Standardize more across the builders for `FileFormat`. How 
`idToConstant` is handled is a good example. That should be passed to the 
builder instead of making the whole API more complicated. Projection is the 
same.
   
   Will see what could be arcived
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] WIP: Interface based DataFile reader and writer API [iceberg]

Reply via email to