liurenjie1024 commented on issue #1604:
URL: https://github.com/apache/iceberg-rust/issues/1604#issuecomment-3205265515

   > Here are my consideration: Datafusion is static partition at plan phase, 
so comparing size base planning, row group plan may achieve a more balanced 
partitioning at data skew scene.
   
   Please remember that this library is not designed for datafusion only, but 
for all compute engines. Pruning row group at planning phase helps a little for 
data skew, but not that much in most cases. Remember that we have more filters 
such as page level filtering, deletion vector, eq deletes to further remove 
records. There are many problems with row group pruning at planning phase:
   
   1. Planning phase typically happens master/driver node in a distributed 
compute engine. So the cost of reading manifest only vs reading parquet files 
is different. Number of data files is much more than manifest files.
   2. The cost of opening data files can't be neglected, not even to mention 
parse footer and run the filter logic. Mainfest files typically are cached in 
compute engine, while data files are not. 
   3. Even if some advanced compute engine could do distributed planning, 
opening data file twice costs much than necessary.
   4. It makes the planning phase format specific.
   
   Compared with the problems it brings, I don't see much benefit it brings. 
There are many ways to deal with data skew, and this approach doesn't help that 
much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to