snithish opened a new issue, #2220: URL: https://github.com/apache/iceberg-rust/issues/2220
### Is your feature request related to a problem or challenge? As noted in [#1604](https://github.com/apache/iceberg-rust/issues/1604), Iceberg-DataFusion read performance is currently bottlenecked by single-threaded execution. While [size-based planning](https://github.com/apache/iceberg-rust/issues/128) is the proposed long-term solution, a more immediate improvement would be to parallelize over FileScanTask and leverage `ArrowReaderBuilder` during plan execution. ### Describe the solution you'd like Pre-calculates the `FileScanTask` streams and partitions them across the available DataFusion partitions, updating the IcebergTableScan struct and ExecutionPlan trait: **Pre-partitioning Scan Tasks:** IcebergTableScan now accepts a grouped tasks: Vec<Vec<FileScanTask>> rather than computing streams eagerly. **Propagating Partition Counts:** The compute_properties method now dynamically returns Partitioning::UnknownPartitioning(tasks.len()) instead of the hardcoded 1. **Parallel Stream Execution:** The execute phase uses `self.tasks.get(partition)` to spawn an `ArrowReaderBuilder` specific only to the slice of tasks mapped to that discrete DataFusion partition index. ### Willingness to contribute I would be willing to contribute to this feature with guidance from the Iceberg Rust community -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
