Re: [PR] feat(datafusion): Expose DataFusion statistics on an IcebergTableScan [iceberg-rust]

via GitHub Sat, 08 Feb 2025 08:22:06 -0800


Xuanwo commented on PR #880:
URL: https://github.com/apache/iceberg-rust/pull/880#issuecomment-2645818431


   > It's our guess that this distinction might arise due to scanning 
primitives used. JanKaul/iceberg-rust leverages ParquetExec from DataFusion, 
which is at this point highly optimized, and probably benefits from a more 
favorable work distribution (e.g. a combination of more evenly spread record 
batches across different partition streams, scanning multiple ranges from same 
Parquet files in parallel Tokio tasks, more efficient pruning etc.) than 
get_batch_stream.
   
   Hi, I believe that highly possible. The existing get batch stream is 
designed for simple workloads and I'm guessing query engines need to build its 
own part distribution logic instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat(datafusion): Expose DataFusion statistics on an IcebergTableScan [iceberg-rust]

Reply via email to