gruuya commented on PR #880:
URL: https://github.com/apache/iceberg-rust/pull/880#issuecomment-2645025164

   > Hi, thank you @gruuya for working on this. Most changes look good to me. 
Waiting for @liurenjie1024 to take another look.
   
   Thank you for taking a look!
   
   > I also think some actual benchmarking is in order.
   
   FWIW, we also decided to run some baseline benchmarking, and the results 
might suggest that there are other things worth investigating when it comes to 
speeding up scanning (other than just exposing statistics):
   
![image](https://github.com/user-attachments/assets/0f084354-0773-4e27-8ca9-c073964ee242)
   The above was measured against a local MinIO without the changes in this PR. 
However since query 1 is also kind of slow, and it doesn't involve joins, it 
seems the problem is not addressable by this PR alone (i.e. it is a systemic 
issue).
   
   We haven't done a deep dive on it, but the comparison with the other 
iceberg-rust implementation is illuminating. 
   
   It's our guess that this distinction might arise due to scanning primitives 
used. 
[JanKaul/iceberg-rust](https://github.com/JanKaul/iceberg-rust/blob/main/datafusion_iceberg/src/table.rs#L714C22-L715)
 leverages ParquetExec from DataFusion, which is at this point highly 
optimized, and probably benefits from a more favorable work distribution (e.g. 
a combination of more evenly spread record batches across different partition 
streams, scanning multiple ranges from same Parquet files in parallel Tokio 
tasks, more efficient pruning etc.) than `get_batch_stream`. 
   
   This ultimately leads to DataFusion inserting an expensive `RepartitionExec` 
node on top of the `IcebergTableScan`. Hence the missing statistics might be a 
second-order effect only.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to