ZENOTME commented on issue #398:
URL: https://github.com/apache/iceberg-rust/issues/398#issuecomment-2165768384

   > Hi, @ZENOTME I think already there exists a `to_arrow` method here:
   > 
   > 
https://github.com/apache/iceberg-rust/blob/15e61f23198c4cc5d320d631e22e2fbc02d167c8/crates/iceberg/src/scan.rs#L294
   
   Yes, but I think the benefit of this PR is more about the use case for the 
computing engine. `to_arrow` is used to convert the scan to an arrow batch 
stream. But for the computing engine, what it expects is to get the file scan 
task from the scan, split these task, and distribute them to different compute 
nodes to get the parallel read ability. I think that's one reason we provide 
the 
https://github.com/apache/iceberg-rust/blob/15e61f23198c4cc5d320d631e22e2fbc02d167c8/crates/iceberg/src/scan.rs#L201.
 
   
   For this use case, the user creates the reader, and uses it to convert the 
file scan task to an arrow batch stream rather than use `to_arrow` directly, 
like the following:
   ```
   let reader =  ArrowReaderBuilder::new(self.file_io.clone(), 
self.schema.clone())
                            .with_field_id(....)
                            .wtih_predict(..)
   
   for file_scan in file_scan_stream {
       let arrow_batch_stream = reader.read(file_scan)
   }
   ```
   
   But for now, the reader is not friendly for this use case. It's redundant 
and prone to inconsistent to provide the `field_id` and `predict` info for the 
reader because these have already been used to create a scan before. A more 
friendly way is to contain this necessary info in the scan task so that the 
reader is just "stateless" and without the inconsistent problem, like the 
following: 
   ```
   let reader =  ArrowReaderBuilder::new(self.file_io.clone(), 
self.schema.clone());
   
   for file_scan in file_scan_stream {
       let arrow_batch_stream = reader.read(file_scan)
   }
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to