notfilippo commented on PR #21829:
URL: https://github.com/apache/datafusion/pull/21829#issuecomment-4333484346

   Thanks for the feedback and the pointer to `remote_catalog.rs`, I was sloppy 
with "at scale". This is coming from some personal notes on the plan for this 
PR. I will work on a better ticket after figuring out if this is the right 
approach
   
   The `CatalogProvider` pattern works well when schema resolution is 
independent of the query. The case I have in mind is different: the schema is 
**predicate-dependent**. Our use case is a store where the column set for a 
given table is not fixed ahead of time, but is determined by opening column 
streams that are selected based on the filter predicates in the query (a wide 
schema-on-read log store where opening all streams upfront is prohibitively 
expensive, and the right streams to open depend on what filters the user wrote).
   
   Pre-resolving before planning means either:
   - Opening everything (expensive, defeats the purpose of predicate pruning), 
or
   - Blocking on a synchronous call that itself needs to peek at the predicate 
at which point you've re-implemented an analysis rule, just outside the planner 
(which is our current approach)
   
   What I would really want is a rule that sees the partially-analyzed plan 
(including predicates), does async I/O to fetch metadata or open the right 
streams, and rewrites the scan node in place. That's the core motivation for 
`AsyncAnalyzerRule`. 
   
   I agree the existing `CatalogProvider` async example covers many remote 
catalog cases. The gap is specifically the predicate-aware, deferred schema 
resolution case. I should have been a bit clearer about that in the description 
rather than saying at "scale." :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to