asolimando commented on PR #21815:
URL: https://github.com/apache/datafusion/pull/21815#issuecomment-4354943055

   > Thanks for the thoughtful response @asolimando — the framing is exactly 
right, and the prior discussion with @kosiew in #21483 is helpful context.
   > 
   > **On scope**: agreed, let's land per-call caching in this PR (your Option 
1) and treat cross-call caching with stable node IDs as a follow-up. Could you 
open an issue for Option 2 so we don't lose track?
   > 
   > **On the cache key**: (Arc::as_ptr, partition) is safe within a single 
synchronous compute_statistics walk — the Arcs are held by the plan tree and 
can't be dropped during the call, so pointer reuse isn't a concern. Good call.
   > 
   > **On benchmarks**: I'd avoid full TPC-DS Q99 — statistics computation is a 
small fraction of total query time and will get lost in noise. A targeted 
micro-bench is more informative:
   > 
   > * Build a deeply nested plan (e.g., a 10+ deep UnionExec chain, or a chain 
of hash joins + repartitions) and time compute_statistics(plan, None) 
before/after this PR.
   > * Optionally reuse a reproducer from [[EPIC] Improve query planning speed 
#19795](https://github.com/apache/datafusion/issues/19795) (planning-speed 
EPIC) since deep plans are exactly that issue's pain point.
   > 
   > That should cleanly demonstrate the gain.
   
   Thanks for the confirmation and the clarifications, I will hopefully get to 
it early next week and I will ping you back as soon as I will have some updates!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to