Re: [PR] Core: Interface based DataFile reader and writer API [iceberg]

via GitHub Mon, 23 Jun 2025 15:12:54 -0700


stevenzwu commented on PR #12298:
URL: https://github.com/apache/iceberg/pull/12298#issuecomment-2998113103


   Right now, there are two questions that Peter and I are discussing offline.
   
   1) is it necessary to have a globally shared `FileAccessFactoryRegistry` in 
the iceberg-data module and have engine to register the factories class (e.g. 
`FlinkObjectModels`)?
   
   The usage pattern in the engine (like Flink) is not much different in both 
models (with or without registry).
   
   here is the usage pattern for with registry model.
   ```
         WriteBuilder<?, RowType, RowData> builder =
             FileAccessFactoryRegistry.writeBuilder(
                 format,
                 FlinkObjectModels.FLINK_OBJECT_MODEL,
                 EncryptedFiles.plainAsEncryptedOutput(outputFile));
   ```
   
   Here is the usage pattern for without registry model (from Peter's testing 
PR #13257).
   ```
         WriteBuilder<?, RowType, RowData> builder =
             FlinkFileAccessor.INSTANCE.writeBuilder(
                 format, EncryptedFiles.plainAsEncryptedOutput(outputFile));
   ```
   
   From my perspective, I don't see much value for sharing the Flink and Spark 
factory registration globally. Each engine knows the factory it should use. 
   
   2) Peter was thinking about consolidating engine integrations in the file 
format module. 
   
   Let's say we want to add support for `Vortex` file format. Here are the 
steps needed after this effort
   
   1. [api] add `vortex` to `FileFormat` enum
   2. [vortex] implement `VortexFileAccessFactory`
   3. [flink/spark] implement vortex row reader and writer
   4. [flink/spark] add the factory implementation and register to the global 
factory (if keeping the registry model)
   
   Peter thought 2-4 can be consolidated to the `iceberg-vortex` module. But it 
means that `iceberg-vortex` module would depend on `iceberg-flink` and 
`iceberg-spark` modules. 
   
   From my perspective, it is not right to have file format module (like 
`iceberg-vortex`) depends on engine module. The reverse dependency model 
(engine depends on file format module) makes more sense to me. The engine 
integration code for file format (readers for engine data type like Flink 
`RowData`) should exist in the engine module. I know it means a new file format 
support would requires changes in engine modules. But that would be the same 
for engines living outside iceberg repo.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Interface based DataFile reader and writer API [iceberg]

Reply via email to