Re: [I] EPIC: Support parallel scan in iceberg-datafusion [iceberg-rust]

via GitHub Tue, 19 Aug 2025 10:14:29 -0700


ZENOTME commented on issue #1604:
URL: https://github.com/apache/iceberg-rust/issues/1604#issuecomment-3201553648


   Thanks feedback from @liurenjie1024! Here are my consideration: Datafusion 
is static partition at plan phase, so comparing size base planning, row group 
plan may achieve a more balanced partitioning at data skew scene. 
   
   > but I have concern to do the filter in task planning phase, as opening 
data files is typically slow
   
   I think the overall execution time remains the same—we’re essentially moving 
the group pruning logic from the reader stage to the planning stage. If 
metadata is retrieved during planning, it can be reused by the reader to avoid 
fetching it again (though we need to weigh the memory overhead).
   
   I think this could be an optional behavior, allowing users to make the 
tradeoff: either spend more time during planning and to achieve a more balanced 
partitioning, or use a size-based plan for faster planning. The best choice 
will depend on the actual data distribution.
   
   I also agree we don’t need to introduce too many optimizations upfront. 
Still, I’m curious whether this is a realistic need in practice—if so, it may 
be worth making some room for it in the design so that we can extend in this 
direction later if necessary.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] EPIC: Support parallel scan in iceberg-datafusion [iceberg-rust]

Reply via email to