asolimando commented on PR #21500: URL: https://github.com/apache/datafusion/pull/21500#issuecomment-4216014619
Hey @Dandandan, I noticed this PR and it resonated a lot with what I am working on lately. Specifically, a combination of https://github.com/apache/datafusion/pull/21483 (introducing `StatisticRegistry` to provide customizable support for statistic at the operator level and in physical rules) and https://github.com/apache/datafusion/pull/21122 (introducing `ExpressionAnalyzer`, adding customizable support for statistics at the expression level, including casts, which is totally missing at the moment, as you noticed) would allow to support similar improvements easily and in a way that downstream consumers could override fully (including custom statistics, but also override the way built-in operator deal with statistics, if one needs different statistics propagation or cardinality estimation). In local testing I had a similar speed up for Q99 (~93x speedup) and others, in local testing, for the same exact reasons you described, improving on statistics allowed to better select the join operand parameters and join sides. At the moment there is still a limitation due to the fully additive and non-breaking stance we took for introducing the framework, but once https://github.com/apache/datafusion/issues/20184 gets in we will be able to have full support for all operators, and improve current CBO decisions like the `JoinSelection` you mentioned, and show similar improvements in TPC-* benchmarks. Additionally, this could lead to moving beyond some per-query flags like [prefer_hash_join](https://github.com/apache/datafusion/blob/1f37a33ce530bdedcaf3aba65295703874cd7d09/datafusion/common/src/config.rs#L1073), and use a CBO approach for setting the join type at individual operator level. Apart from sharing a potential connection to what I am working on (in case you are interested), I also wanted to ask you why the PR closed, are you exploring other angles or you hit a roadblock? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
