notfilippo commented on PR #21829: URL: https://github.com/apache/datafusion/pull/21829#issuecomment-4333484346
Thanks for the feedback and the pointer to `remote_catalog.rs`, I was sloppy with "at scale". This is coming from some personal notes on the plan for this PR. I will work on a better ticket after figuring out if this is the right approach The `CatalogProvider` pattern works well when schema resolution is independent of the query. The case I have in mind is different: the schema is **predicate-dependent**. Our use case is a store where the column set for a given table is not fixed ahead of time, but is determined by opening column streams that are selected based on the filter predicates in the query (a wide schema-on-read log store where opening all streams upfront is prohibitively expensive, and the right streams to open depend on what filters the user wrote). Pre-resolving before planning means either: - Opening everything (expensive, defeats the purpose of predicate pruning), or - Blocking on a synchronous call that itself needs to peek at the predicate at which point you've re-implemented an analysis rule, just outside the planner (which is our current approach) What I would really want is a rule that sees the partially-analyzed plan (including predicates), does async I/O to fetch metadata or open the right streams, and rewrites the scan node in place. That's the core motivation for `AsyncAnalyzerRule`. I agree the existing `CatalogProvider` async example covers many remote catalog cases. The gap is specifically the predicate-aware, deferred schema resolution case. I should have been a bit clearer about that in the description rather than saying at "scale." :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
