xudong963 commented on PR #21815:
URL: https://github.com/apache/datafusion/pull/21815#issuecomment-4352740602

   Thanks for the thoughtful response @asolimando — the framing is exactly 
right, and the prior discussion with @kosiew in #21483 is helpful context.
   
   **On scope**: agreed, let's land per-call caching in this PR (your Option 1) 
and treat cross-call caching with stable node IDs as a follow-up. Could you 
open an issue for Option 2 so we don't lose track?
   
   **On the cache key**: (Arc::as_ptr, partition) is safe within a single 
synchronous compute_statistics walk — the Arcs are held by the plan tree and 
can't be dropped during the call, so pointer reuse isn't a concern. Good call.
   
   
   **On benchmarks**: I'd avoid full TPC-DS Q99 — statistics computation is a 
small fraction of total query time and will get lost in noise. A targeted 
micro-bench is more informative:
   
     - Build a deeply nested plan (e.g., a 10+ deep UnionExec chain, or a chain 
of hash joins + repartitions) and time compute_statistics(plan, None) 
before/after this PR.
     - Optionally reuse a reproducer from #19795 (planning-speed EPIC) since 
deep plans are exactly that issue's pain point.
   
   That should cleanly demonstrate the gain.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to