liurenjie1024 commented on issue #1604: URL: https://github.com/apache/iceberg-rust/issues/1604#issuecomment-3205265515
> Here are my consideration: Datafusion is static partition at plan phase, so comparing size base planning, row group plan may achieve a more balanced partitioning at data skew scene. Please remember that this library is not designed for datafusion only, but for all compute engines. Pruning row group at planning phase helps a little for data skew, but not that much in most cases. Remember that we have more filters such as page level filtering, deletion vector, eq deletes to further remove records. There are many problems with row group pruning at planning phase: 1. Planning phase typically happens master/driver node in a distributed compute engine. So the cost of reading manifest only vs reading parquet files is different. Number of data files is much more than manifest files. 2. The cost of opening data files can't be neglected, not even to mention parse footer and run the filter logic. Mainfest files typically are cached in compute engine, while data files are not. 3. Even if some advanced compute engine could do distributed planning, opening data file twice costs much than necessary. 4. It makes the planning phase format specific. Compared with the problems it brings, I don't see much benefit it brings. There are many ways to deal with data skew, and this approach doesn't help that much. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
