Hi Dmitri, Thank you for your clear guidance!
I completely agree with the unified namespace tree principle. To ensure Polaris acts as the single source of truth and avoids resolution ambiguity, I will refactor the implementation to follow a lookup then dispatch pattern. Instead of speculative probing, the sparkCatalog will first resolve the table entity via Polaris metadata to identify the provider, then deterministically route the call or throw a Table format mismatch error if the API mode is incompatible. I have another question regarding table registration for non-delegating formats. Since Paimon does not support a delegating catalog mode (unlike Delta/Hudi), it cannot automatically notify Polaris of its changes. In my PR, I've implemented an explicit dual-registration during createTable (Physical creation in Paimon warehouse followed by logical registration in Polaris). This ensures Paimon tables are visible via SHOW TABLES. I would like to ask if the community has better ideas for handling such standalone formats? (From my perspective, the dual-registration is not an atomic operator for both systems. There's still a chance that only one of the services succeeds but the other fails, which will cause inconsistency. However, it _seems_ this is the only way to achieve it for non-delegating format.) The alternative for having Polaris actively scan external warehouses which seems to introduce significant performance overhead. Is there a more elegant way to ensure catalog visibility without sacrificing the goal of single source of truth , or is this explicit registration the preferred pattern for now? Best regards, I-Ting Dmitri Bourlatchkov <[email protected]> 於 2026年3月16日週一 下午9:42寫道: > Hi I-Ting, > > Thanks for starting this discussion. You bring up important points. > > From my point of view, the catalog data controlled by Polaris should form a > unified namespace tree. In other words, each full table name owned by > Polaris must be unique and resolve to the same table entity regardless of > the API used by the client. > > If a name is accessed via the Icebert REST Catalog API and happens to point > to a Paimon table, I think Polaris ought to report an error to the client > (something like HTTP 422 "Table format mismatch"). > > If a name is accessed via the Generic Tables API, the response must > indicate actual table format. > > I do not think the client should make multiple "lookup" calls for the same > table name. That creates ambiguity in the name resolution logic and could > lead to different lookup results in different clients. > > I believe the client should select the API it wants to use (IRC or Generic > Tables) at setup time and then rely on that API for all primary lookup > calls. > > WDYT? > > Thanks, > Dmitri. > > On Sat, Mar 14, 2026 at 3:34 AM 李宜頲 <[email protected]> wrote: > > > Hi all, > > > > We are adding support for Paimon inside Polaris's SparkCatalog. Before we > > add more formats, we would like to get community input on the intended > > architecture. > > > > This discussion originated from a code review conversation in PR #3820 > > <https://github.com/apache/polaris/pull/3820#discussion_r2865885791> > > > > > > > > *Current design* > > > > When SparkCatalog.loadTable is called, the routing works in three phases: > > > > > > 1. Try the Iceberg catalog (icebergSparkCatalog.loadTable). If it > succeeds, > > return immediately. > > > > 2. Call getTableFormat(ident), which makes a single HTTP GET to the > Polaris > > server to read the provider property stored in the generic table > metadata, > > without triggering any Spark DataSource resolution. > > > > 3. Route based on the provider string: > > > > - "paimon" : delegate to Paimon's SparkCatalog > > > > - unknown/other : fall back to polarisSparkCatalog.loadTable, which > > performs full DataSource resolution > > > > > > The same three-phase pattern is repeated independently in loadTable, > > alterTable, and dropTable*(But createTable is not following this > pattern)*. > > It might raise the concern that this makes the routing logic intrusive: > > every new format requires parallel changes across all three methods, and > > there is no single place that describes the full routing policy. > > > > > > *Questions for discussion* > > > > > > 1. Should Polaris determine the provider first (via metadata) and > delegate > > to a single matching catalog, or should it attempt multiple sub-catalogs > in > > a defined order? > > > > 2. If multiple sub-catalogs are supported, should there be a documented, > > deterministic > > > > resolution order (Iceberg -> Paimon -> Delta -> Hudi -> Polaris > > fallback)? Who owns that order, should it be configurable by operators? > > > > 3. Should the per-format routing logic be centralised behind an > abstraction > > (e.g. a SubCatalogRouter interface or a provider registry), so that > adding > > a new format is a single registration rather than edits across loadTable, > > alterTable, and dropTable? > > > > 4. Consistency:Should all table operations (loadTable, createTable, > > alterTable, dropTable, > > > > renameTable) follow the same routing strategy, or are per-operation > > differences acceptable? Currently createTable has a different branching > > structure from loadTable. > > > > 5. Is it in scope for Polaris to act as a routing layer for multiple > table > > providers, or should users who need both Polaris and Paimon configure > them > > as separate catalogs in their Spark session and route at the session > level > > themselves? > > > > > > We have a working Paimon implementation today and would like to avoid > > locking in a pattern that becomes hard to extend. Any input on the design > > direction, or pointers to prior discussion on this topic, would be much > > appreciated. > > > > > > Best regards, > > > > I-Ting > > >
