roeap commented on issue #1314: URL: https://github.com/apache/iceberg-rust/issues/1314#issuecomment-2903969557
Just sharing some experiences from the delta world which may not immediately applicable to the question around which trait to use, but maybe be food for thought as to where things could be heading? One thing that repeatedly comes up when talking about table formats is "Metadata is Data". The to me logical consequence of that is to treat it as such, meaning process it with the same tools that you would use processing data. To that avail delta-rs currently keeps all metadata around as arrow record batches, and delta-kernel goes even further abstracting away the specific data representation. As such the higher level abstractions we chose are on the level of file formats. I.e. read this {parquet,json,avro,..} file into arrow with this schema. The internal logic processing the metadata either visits individual fields or applies expressions on the data to generate the plans for scans etc. I think to a certain degree this thinking is actually baked into the Iceberg spec via the metadata tables. By default we provide an arrow (arrays and kernels) and object_store based implementation using many of the same tools used here to read data. Currently I am working on a datafusion engine for kernel, where datafusions execution plans are used to read data and datafusions' native expression for evaluation. As a consequence virtually all resource management is under full control of the query engine which is also free to apply any more advanced optimisations (caching, etc.) as it sees fit. All that said, I am about to start a PoC to find out how much of the query planning and eventually also maintenance that is implemented in aforementioned datafusion engine can be applied to both delta and iceberg. One thing I am fairly certain of is that the work discussed here will be making my life much easier, and if we end up in a place where we can at least do something like ... ```rust impl<T: ObjectStore> FileIo for T { ... } ``` that would be awesome! Once we have a consensus here, I am happy to offer my support driving this forward! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org