roeap commented on issue #1314:
URL: https://github.com/apache/iceberg-rust/issues/1314#issuecomment-2903969557
Just sharing some experiences from the delta world which may not immediately
applicable to the question around which trait to use, but maybe be food for
thought as to where things could be heading?
One thing that repeatedly comes up when talking about table formats is
"Metadata is Data". The to me logical consequence of that is to treat it as
such, meaning process it with the same tools that you would use processing
data. To that avail delta-rs currently keeps all metadata around as arrow
record batches, and delta-kernel goes even further abstracting away the
specific data representation.
As such the higher level abstractions we chose are on the level of file
formats. I.e. read this {parquet,json,avro,..} file into arrow with this
schema. The internal logic processing the metadata either visits individual
fields or applies expressions on the data to generate the plans for scans etc.
I think to a certain degree this thinking is actually baked into the Iceberg
spec via the metadata tables.
By default we provide an arrow (arrays and kernels) and object_store based
implementation using many of the same tools used here to read data. Currently I
am working on a datafusion engine for kernel, where datafusions execution plans
are used to read data and datafusions' native expression for evaluation.
As a consequence virtually all resource management is under full control of
the query engine which is also free to apply any more advanced optimisations
(caching, etc.) as it sees fit.
All that said, I am about to start a PoC to find out how much of the query
planning and eventually also maintenance that is implemented in aforementioned
datafusion engine can be applied to both delta and iceberg.
One thing I am fairly certain of is that the work discussed here will be
making my life much easier, and if we end up in a place where we can at least
do something like ...
```rust
impl<T: ObjectStore> FileIo for T {
...
}
```
that would be awesome!
Once we have a consensus here, I am happy to offer my support driving this
forward!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]