gruuya commented on code in PR #880: URL: https://github.com/apache/iceberg-rust/pull/880#discussion_r1908312439
########## crates/integrations/datafusion/src/table/mod.rs: ########## @@ -41,16 +42,21 @@ pub struct IcebergTableProvider { table: Table, /// Table snapshot id that will be queried via this provider. snapshot_id: Option<i64>, + /// Statistics for the table; row count and null count/min-max values per column. + /// If not present defaults to `None`. + statistics: Option<Statistics>, /// A reference-counted arrow `Schema`. schema: ArrowSchemaRef, } impl IcebergTableProvider { - pub(crate) fn new(table: Table, schema: ArrowSchemaRef) -> Self { + pub(crate) async fn new(table: Table, schema: ArrowSchemaRef) -> Self { + let statistics = compute_statistics(&table, None).await.ok(); Review Comment: I've made the statistics computation opt-in along the above lines now. > I think you are referring to join reordering algorithm in query optimizer? Yes, that is correct. > From my experience, complex table statistics doesn't help much in join reordering. I think there are cases where it can help significantly, see https://github.com/apache/datafusion/issues/7949 and https://github.com/apache/datafusion/issues/7950 for instance. > For example, if the joined table has many filters, how would you estimate correct statistics after filtering. Histogram may help for single column filter, but not for complex filters. Also cardinality estimation in join doesn't work well. Yeah admittedly, the entire procedure is based on a number of heuristics, and can be quite guesstimatative in nature. Still I think there's considerable value to be extracted, even if only hints are provided; cc @alamb who knows a lot more about potential pitfalls and upsides here than me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org