asolimando commented on PR #21815: URL: https://github.com/apache/datafusion/pull/21815#issuecomment-4351618242
> @asolimando thanks! I'm sorry that I'm busy with others this week. > > This PR doesn't fully solve the problem it claims to. The stated goal in the PR description and #20184 is to eliminate exponential recomputation. But for any plan containing a `CoalescePartitionsExec`, `SortPreservingMergeExec`, `RepartitionExec`, `HashJoinExec` (CollectLeft/Auto), `CrossJoinExec`, or `NestedLoopJoinExec` — which is most non-trivial plans — the operator restarts a _fresh_ bottom-up walk from inside its own `partition_statistics` IIUC. So the recomputation isn't gone; > > Caching sounds good, how about making caching part of `StatisticsContext` from day one, then we can have some benchmarks to show off the gains which will be easier for the community to accept the PR, wdyt? Thank you for your input @xudong963, no need to apologies, it's understandable! You raise a fair point, we fully avoid the recomputation only for linear plans, but operators that call `compute_statistics(child, None)` internally don't benefit. This is noted in the "What remains for follow-up" section but I agree it might not be enough for the first iteration, and I anyway should have marked "partially closes https://github.com/apache/datafusion/issues/20184". Re. the cache, I identified the need for the `StatisticsRegistry` already, and we discussed with @kosiew in the related PR (#21483, [comment](https://github.com/apache/datafusion/pull/21483/files#r2033980925), branch `asolimando/statistics-planner-with-statscache-v2`). We agreed to defer it to limit scope, but this is the right place to discuss it. One limitation I identified on the `StatsCache` (as I called it there), is around the cache key, which should "identify" an `ExecutionPlan`, which doesn't have any stable id other than its memory pointer ( so the cache key is effectively `(Arc::as_ptr, partition)`), but I am concerned of nodes being disposed (and re-used). Cache lifecycle/scope: 1. single invocation of `compute_statistics` (as described in https://github.com/apache/datafusion/issues/20184): if we agree on this, then the concern is not valid, as the plan tree is "stable" during the lifetime. When e.g. `CoalescePartitionsExec` calls `compute_statistics(child, None)` internally, the cache already has the subtree results, fully eliminating redundant walks. 2. multiple invocations of `compute_statistics` (same rule or cross-rules): here we necessarily need a stable node ID and we can't rely on the pointer, since nodes can be dropped/recreated The scope of #20184 is, in my understanding, 1. (single walk), if you agree with that, I plan to use `(Arc::as_ptr, partition)` as cache key, and introducing node IDs and expanding the cache lifetime IMO be tackled as a followup (I can create issues for that, if the direction is confirmed), as with this solution we should already see computational benefits. Re. benchmarks, do you have a specific workload in mind (e.g., TPC-DS, Q99)? Also, could I be added to the allowlist to trigger benchmark runs so I can iterate without requiring manual re-runs, in case I need multiple iterations? WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
