Re: [DISCUSS] Generic table delegation strategy in Polaris SparkCatalog

ITing Lee Tue, 17 Mar 2026 07:19:27 -0700

Hi Dmitri,

Thank you for your clear guidance!



I completely agree with the unified namespace tree principle.

To ensure Polaris acts as the single source of truth and avoids resolution
ambiguity, I will refactor the implementation to follow a lookup then
dispatch pattern.

Instead of speculative probing, the sparkCatalog will first resolve the
table entity via Polaris metadata to identify the provider, then
deterministically route the call or throw a Table format mismatch error if
the API mode is incompatible.


I have another question regarding table registration for non-delegating
formats.

Since Paimon does not support a delegating catalog mode (unlike
Delta/Hudi), it cannot automatically notify Polaris of its changes.

In my PR, I've implemented an explicit dual-registration during createTable
(Physical creation in Paimon warehouse followed by logical registration in
Polaris).

This ensures Paimon tables are visible via SHOW TABLES.


I would like to ask if the community has better ideas for handling such
standalone formats? (From my perspective, the dual-registration is not an
atomic operator for both systems. There's still a  chance that only one of
the services succeeds but the other fails, which will cause inconsistency.
However, it _seems_ this is the only way to achieve it for non-delegating
format.)


The alternative for having Polaris actively scan external warehouses which
seems to introduce significant performance overhead.

Is there a more elegant way to ensure catalog visibility without
sacrificing the goal of single source of truth , or is this explicit
registration the preferred pattern for now?


Best regards,

I-Ting

Dmitri Bourlatchkov <[email protected]> 於 2026年3月16日週一 下午9:42寫道：

> Hi I-Ting,
>
> Thanks for starting this discussion. You bring up important points.
>
> From my point of view, the catalog data controlled by Polaris should form a
> unified namespace tree. In other words, each full table name owned by
> Polaris must be unique and resolve to the same table entity regardless of
> the API used by the client.
>
> If a name is accessed via the Icebert REST Catalog API and happens to point
> to a Paimon table, I think Polaris ought to report an error to the client
> (something like HTTP 422 "Table format mismatch").
>
> If a name is accessed via the Generic Tables API, the response must
> indicate actual table format.
>
> I do not think the client should make multiple "lookup" calls for the same
> table name. That creates ambiguity in the name resolution logic and could
> lead to different lookup results in different clients.
>
> I believe the client should select the API it wants to use (IRC or Generic
> Tables) at setup time and then rely on that API for all primary lookup
> calls.
>
> WDYT?
>
> Thanks,
> Dmitri.
>
> On Sat, Mar 14, 2026 at 3:34 AM 李宜頲 <[email protected]> wrote:
>
> > Hi all,
> >
> > We are adding support for Paimon inside Polaris's SparkCatalog. Before we
> > add more formats, we would like to get community input on the intended
> > architecture.
> >
> > This discussion originated from a code review conversation in PR #3820
> > <https://github.com/apache/polaris/pull/3820#discussion_r2865885791>
> >
> >
> >
> > *Current design*
> >
> > When SparkCatalog.loadTable is called, the routing works in three phases:
> >
> >
> > 1. Try the Iceberg catalog (icebergSparkCatalog.loadTable). If it
> succeeds,
> > return immediately.
> >
> > 2. Call getTableFormat(ident), which makes a single HTTP GET to the
> Polaris
> > server to read the provider property stored in the generic table
> metadata,
> > without triggering any Spark DataSource resolution.
> >
> > 3. Route based on the provider string:
> >
> >     - "paimon"  : delegate to Paimon's SparkCatalog
> >
> >     - unknown/other : fall back to polarisSparkCatalog.loadTable, which
> > performs full DataSource resolution
> >
> >
> > The same three-phase pattern is repeated independently in loadTable,
> > alterTable, and dropTable*（But createTable is not following this
> pattern)*.
> > It might raise the concern that this makes the routing logic intrusive:
> > every new format requires parallel changes across all three methods, and
> > there is no single place that describes the full routing policy.
> >
> >
> > *Questions for discussion*
> >
> >
> > 1. Should Polaris determine the provider first (via metadata) and
> delegate
> > to a single matching catalog, or should it attempt multiple sub-catalogs
> in
> > a defined order?
> >
> > 2. If multiple sub-catalogs are supported, should there be a documented,
> > deterministic
> >
> >   resolution order (Iceberg -> Paimon -> Delta -> Hudi -> Polaris
> > fallback)? Who owns that order, should it be configurable by operators?
> >
> > 3. Should the per-format routing logic be centralised behind an
> abstraction
> > (e.g. a SubCatalogRouter interface or a provider registry), so that
> adding
> > a new format is a single registration rather than edits across loadTable,
> > alterTable, and dropTable?
> >
> > 4. Consistency：Should all table operations (loadTable, createTable,
> > alterTable, dropTable,
> >
> >   renameTable) follow the same routing strategy, or are per-operation
> > differences acceptable? Currently createTable has a different branching
> > structure from loadTable.
> >
> > 5. Is it in scope for Polaris to act as a routing layer for multiple
> table
> > providers, or should users who need both Polaris and Paimon configure
> them
> > as separate catalogs in their Spark session and route at the session
> level
> > themselves?
> >
> >
> > We have a working Paimon implementation today and would like to avoid
> > locking in a pattern that becomes hard to extend. Any input on the design
> > direction, or pointers to prior discussion on this topic, would be much
> > appreciated.
> >
> >
> > Best regards,
> >
> > I-Ting
> >
>

Re: [DISCUSS] Generic table delegation strategy in Polaris SparkCatalog

Reply via email to