pvary commented on PR #12298: URL: https://github.com/apache/iceberg/pull/12298#issuecomment-2788854677
> Overall, +1 from me 🎉 > > I made a prototype Lance implementation here: [84bf5c5](https://github.com/apache/iceberg/commit/84bf5c53bc5ea19101bb7f21d72f24666c2b3804) @westonpace: Thanks for the feedback, I really appreciate that you took time to implement the API for Lance and shared your learnings! > That being said, I think a really cool addition in the future would be a base implementation that uses Arrow. As long as a reader/writer can produce/consume VectorSchemaRoot and it puts the field ids in the Arrow field schema, then 80% of the glue code will be provided for them. The name mapping, field id handling, constant handling, and spark<->arrow conversion could all be part of the base implementation. Are you suggesting that we should use Arrow as an intermediate format? So basically Iceberg should implement the transformations between an Arrow `VectorSchemaRoot` to the engine specific `ObjectModel`s (Generic/Spark/Flink), and the File Formats should implement the transformation between the File Format internal model and the Arrow `VectorSchemaRoot`? What do you think about the overhead (memory/CPU) of the double transformation? Do you have experience with this on the hot path for reading/writing? I specifically tried to avoid the double transformation to ensure that the performance doesn't suffer. Thanks, Peter -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org