ZENOTME commented on issue #1604: URL: https://github.com/apache/iceberg-rust/issues/1604#issuecomment-3201553648
Thanks feedback from @liurenjie1024! Here are my consideration: Datafusion is static partition at plan phase, so comparing size base planning, row group plan may achieve a more balanced partitioning at data skew scene. > but I have concern to do the filter in task planning phase, as opening data files is typically slow I think the overall execution time remains the same—we’re essentially moving the group pruning logic from the reader stage to the planning stage. If metadata is retrieved during planning, it can be reused by the reader to avoid fetching it again (though we need to weigh the memory overhead). I think this could be an optional behavior, allowing users to make the tradeoff: either spend more time during planning and to achieve a more balanced partitioning, or use a size-based plan for faster planning. The best choice will depend on the actual data distribution. I also agree we don’t need to introduce too many optimizations upfront. Still, I’m curious whether this is a realistic need in practice—if so, it may be worth making some room for it in the design so that we can extend in this direction later if necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
