Re: [PR] Fix single-threaded bottleneck in parquet file stream processing [iceberg-rust]

via GitHub Sat, 18 Oct 2025 05:36:06 -0700


liurenjie1024 commented on PR #1684:
URL: https://github.com/apache/iceberg-rust/pull/1684#issuecomment-3346233140


   > Is that right? Or do you think it'd be possible to parallelize things on 
the client side of the core crate?
   
   In fact, it not right. The desired flow is like following:
   1. (core crate)`TableScan.plan_files` to split the scanning into several 
pieces, each `FileScanTask` contains several parts, each part is part of a 
large parquet data file.
   2. (external engine) The  external engine parallels scanning by running 
`FileScanTask` in parallel. For example in spark, each `FileScanTask` will be 
assigned to one task.
   3. (core crate) The `ArrowReader` accepts one `FileScanTask` and read them 
into arrow data stream. This happens in core crate because some iceberg 
specific thing like type promotion, field match by id should be handled by 
iceberg.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix single-threaded bottleneck in parquet file stream processing [iceberg-rust]

Reply via email to