ZENOTME commented on issue #398: URL: https://github.com/apache/iceberg-rust/issues/398#issuecomment-2165768384
> Hi, @ZENOTME I think already there exists a `to_arrow` method here: > > https://github.com/apache/iceberg-rust/blob/15e61f23198c4cc5d320d631e22e2fbc02d167c8/crates/iceberg/src/scan.rs#L294 Yes, but I think the benefit of this PR is more about the use case for the computing engine. `to_arrow` is used to convert the scan to an arrow batch stream. But for the computing engine, what it expects is to get the file scan task from the scan, split these task, and distribute them to different compute nodes to get the parallel read ability. I think that's one reason we provide the https://github.com/apache/iceberg-rust/blob/15e61f23198c4cc5d320d631e22e2fbc02d167c8/crates/iceberg/src/scan.rs#L201. For this use case, the user creates the reader, and uses it to convert the file scan task to an arrow batch stream rather than use `to_arrow` directly, like the following: ``` let reader = ArrowReaderBuilder::new(self.file_io.clone(), self.schema.clone()) .with_field_id(....) .wtih_predict(..) for file_scan in file_scan_stream { let arrow_batch_stream = reader.read(file_scan) } ``` But for now, the reader is not friendly for this use case. It's redundant and prone to inconsistent to provide the `field_id` and `predict` info for the reader because these have already been used to create a scan before. A more friendly way is to contain this necessary info in the scan task so that the reader is just "stateless" and without the inconsistent problem, like the following: ``` let reader = ArrowReaderBuilder::new(self.file_io.clone(), self.schema.clone()); for file_scan in file_scan_stream { let arrow_batch_stream = reader.read(file_scan) } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org