Re: [PR] Core: Interface based DataFile reader and writer API [iceberg]

via GitHub Thu, 10 Apr 2025 10:56:26 -0700


westonpace commented on PR #12298:
URL: https://github.com/apache/iceberg/pull/12298#issuecomment-2790138898


   > Are you suggesting that we should use Arrow as an intermediate format?
   
   That is one way, and personally, I'd be a fan, but I wasn't sure it was 
practical or not in the code base.
   
   I was thinking more that there could just be an abstract base class for 
formats that were Arrow-native.  E.g. "here is the reader api and you need to 
implement these 10 methods" but also "if you speak arrow, extend this, which 
implements the 10 methods for you as long as you implement these 3 
arrow-methods" (I'm just making the 3 and 10 up here).
   
   > What do you think about the overhead (memory/CPU) of the double 
transformation? Do you have experience with this on the hot path for 
reading/writing?
   
   Arrow is designed to be zero-copy.  Currently, I believe there are no 
transformations on the read path.  The data starts out as an arrow-rs 
`RecordBatch`.  When we cross the JNI boundary we use the [C data 
interface](https://arrow.apache.org/docs/format/CDataInterface.html) so only 
the metadata is marshaled to create the VectorSchemaRoot.  Then, and this is 
where I'm a little less certain, I believe it is zero-copy to create Spark's 
`ColumnarBatch` from `VectorSchemaRoot` (`ArrowColumnVector` is a subclass of 
`ColumnVector` that satisfies the APIs against the underlying vector directly 
instead of copying it). So at most there should be a few hundred bytes of 
allocation (for things like lists of children in structs) per batch.
   
   The write path does have one transformation.  Spark is not providing us a 
`ColumnarBatch` on write (and even if it did I doubt it would be backed by 
`ArrowColumnVector`) and so a copy is required to go from `InternalRow` to 
`VectorSchemaRoot`.  However, the copy from `VectorSchemaRoot` to Rust's 
`RecordBatch` is zero-copy so there is only one transformation total.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Core: Interface based DataFile reader and writer API [iceberg]

Reply via email to