Hi all, We are adding support for Paimon inside Polaris's SparkCatalog. Before we add more formats, we would like to get community input on the intended architecture.
This discussion originated from a code review conversation in PR #3820 <https://github.com/apache/polaris/pull/3820#discussion_r2865885791> *Current design* When SparkCatalog.loadTable is called, the routing works in three phases: 1. Try the Iceberg catalog (icebergSparkCatalog.loadTable). If it succeeds, return immediately. 2. Call getTableFormat(ident), which makes a single HTTP GET to the Polaris server to read the provider property stored in the generic table metadata, without triggering any Spark DataSource resolution. 3. Route based on the provider string: - "paimon" : delegate to Paimon's SparkCatalog - unknown/other : fall back to polarisSparkCatalog.loadTable, which performs full DataSource resolution The same three-phase pattern is repeated independently in loadTable, alterTable, and dropTable*(But createTable is not following this pattern)*. It might raise the concern that this makes the routing logic intrusive: every new format requires parallel changes across all three methods, and there is no single place that describes the full routing policy. *Questions for discussion* 1. Should Polaris determine the provider first (via metadata) and delegate to a single matching catalog, or should it attempt multiple sub-catalogs in a defined order? 2. If multiple sub-catalogs are supported, should there be a documented, deterministic resolution order (Iceberg -> Paimon -> Delta -> Hudi -> Polaris fallback)? Who owns that order, should it be configurable by operators? 3. Should the per-format routing logic be centralised behind an abstraction (e.g. a SubCatalogRouter interface or a provider registry), so that adding a new format is a single registration rather than edits across loadTable, alterTable, and dropTable? 4. Consistency:Should all table operations (loadTable, createTable, alterTable, dropTable, renameTable) follow the same routing strategy, or are per-operation differences acceptable? Currently createTable has a different branching structure from loadTable. 5. Is it in scope for Polaris to act as a routing layer for multiple table providers, or should users who need both Polaris and Paimon configure them as separate catalogs in their Spark session and route at the session level themselves? We have a working Paimon implementation today and would like to avoid locking in a pattern that becomes hard to extend. Any input on the design direction, or pointers to prior discussion on this topic, would be much appreciated. Best regards, I-Ting
