westonpace commented on PR #12298: URL: https://github.com/apache/iceberg/pull/12298#issuecomment-2790138898
> Are you suggesting that we should use Arrow as an intermediate format? That is one way, and personally, I'd be a fan, but I wasn't sure it was practical or not in the code base. I was thinking more that there could just be an abstract base class for formats that were Arrow-native. E.g. "here is the reader api and you need to implement these 10 methods" but also "if you speak arrow, extend this, which implements the 10 methods for you as long as you implement these 3 arrow-methods" (I'm just making the 3 and 10 up here). > What do you think about the overhead (memory/CPU) of the double transformation? Do you have experience with this on the hot path for reading/writing? Arrow is designed to be zero-copy. Currently, I believe there are no transformations on the read path. The data starts out as an arrow-rs `RecordBatch`. When we cross the JNI boundary we use the [C data interface](https://arrow.apache.org/docs/format/CDataInterface.html) so only the metadata is marshaled to create the VectorSchemaRoot. Then, and this is where I'm a little less certain, I believe it is zero-copy to create Spark's `ColumnarBatch` from `VectorSchemaRoot` (`ArrowColumnVector` is a subclass of `ColumnVector` that satisfies the APIs against the underlying vector directly instead of copying it). So at most there should be a few hundred bytes of allocation (for things like lists of children in structs) per batch. The write path does have one transformation. Spark is not providing us a `ColumnarBatch` on write (and even if it did I doubt it would be backed by `ArrowColumnVector`) and so a copy is required to go from `InternalRow` to `VectorSchemaRoot`. However, the copy from `VectorSchemaRoot` to Rust's `RecordBatch` is zero-copy so there is only one transformation total. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
