Re: [DISCUSS] Generic table delegation strategy in Polaris SparkCatalog

Dmitri Bourlatchkov Tue, 17 Mar 2026 07:45:26 -0700

Hi I-Ting,

Unfortunately, I do not have an answer to your double registration question
off the top of my head, but I added an item for this discussion to the
Community Sync [1] agenda for March 19.


[1] https://polaris.apache.org/community/meetings/

Cheers,
Dmitri.

On Tue, Mar 17, 2026 at 10:19 AM ITing Lee <[email protected]> wrote:

> Hi Dmitri,
>
> Thank you for your clear guidance!
>
>
> I completely agree with the unified namespace tree principle.
>
> To ensure Polaris acts as the single source of truth and avoids resolution
> ambiguity, I will refactor the implementation to follow a lookup then
> dispatch pattern.
>
> Instead of speculative probing, the sparkCatalog will first resolve the
> table entity via Polaris metadata to identify the provider, then
> deterministically route the call or throw a Table format mismatch error if
> the API mode is incompatible.
>
>
> I have another question regarding table registration for non-delegating
> formats.
>
> Since Paimon does not support a delegating catalog mode (unlike
> Delta/Hudi), it cannot automatically notify Polaris of its changes.
>
> In my PR, I've implemented an explicit dual-registration during createTable
> (Physical creation in Paimon warehouse followed by logical registration in
> Polaris).
>
> This ensures Paimon tables are visible via SHOW TABLES.
>
>
> I would like to ask if the community has better ideas for handling such
> standalone formats? (From my perspective, the dual-registration is not an
> atomic operator for both systems. There's still a  chance that only one of
> the services succeeds but the other fails, which will cause inconsistency.
> However, it _seems_ this is the only way to achieve it for non-delegating
> format.)
>
>
> The alternative for having Polaris actively scan external warehouses which
> seems to introduce significant performance overhead.
>
> Is there a more elegant way to ensure catalog visibility without
> sacrificing the goal of single source of truth , or is this explicit
> registration the preferred pattern for now?
>
>
> Best regards,
>
> I-Ting
>
> Dmitri Bourlatchkov <[email protected]> 於 2026年3月16日週一 下午9:42寫道：
>
> > Hi I-Ting,
> >
> > Thanks for starting this discussion. You bring up important points.
> >
> > From my point of view, the catalog data controlled by Polaris should
> form a
> > unified namespace tree. In other words, each full table name owned by
> > Polaris must be unique and resolve to the same table entity regardless of
> > the API used by the client.
> >
> > If a name is accessed via the Icebert REST Catalog API and happens to
> point
> > to a Paimon table, I think Polaris ought to report an error to the client
> > (something like HTTP 422 "Table format mismatch").
> >
> > If a name is accessed via the Generic Tables API, the response must
> > indicate actual table format.
> >
> > I do not think the client should make multiple "lookup" calls for the
> same
> > table name. That creates ambiguity in the name resolution logic and could
> > lead to different lookup results in different clients.
> >
> > I believe the client should select the API it wants to use (IRC or
> Generic
> > Tables) at setup time and then rely on that API for all primary lookup
> > calls.
> >
> > WDYT?
> >
> > Thanks,
> > Dmitri.
> >
> > On Sat, Mar 14, 2026 at 3:34 AM 李宜頲 <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > We are adding support for Paimon inside Polaris's SparkCatalog. Before
> we
> > > add more formats, we would like to get community input on the intended
> > > architecture.
> > >
> > > This discussion originated from a code review conversation in PR #3820
> > > <https://github.com/apache/polaris/pull/3820#discussion_r2865885791>
> > >
> > >
> > >
> > > *Current design*
> > >
> > > When SparkCatalog.loadTable is called, the routing works in three
> phases:
> > >
> > >
> > > 1. Try the Iceberg catalog (icebergSparkCatalog.loadTable). If it
> > succeeds,
> > > return immediately.
> > >
> > > 2. Call getTableFormat(ident), which makes a single HTTP GET to the
> > Polaris
> > > server to read the provider property stored in the generic table
> > metadata,
> > > without triggering any Spark DataSource resolution.
> > >
> > > 3. Route based on the provider string:
> > >
> > >     - "paimon"  : delegate to Paimon's SparkCatalog
> > >
> > >     - unknown/other : fall back to polarisSparkCatalog.loadTable, which
> > > performs full DataSource resolution
> > >
> > >
> > > The same three-phase pattern is repeated independently in loadTable,
> > > alterTable, and dropTable*（But createTable is not following this
> > pattern)*.
> > > It might raise the concern that this makes the routing logic intrusive:
> > > every new format requires parallel changes across all three methods,
> and
> > > there is no single place that describes the full routing policy.
> > >
> > >
> > > *Questions for discussion*
> > >
> > >
> > > 1. Should Polaris determine the provider first (via metadata) and
> > delegate
> > > to a single matching catalog, or should it attempt multiple
> sub-catalogs
> > in
> > > a defined order?
> > >
> > > 2. If multiple sub-catalogs are supported, should there be a
> documented,
> > > deterministic
> > >
> > >   resolution order (Iceberg -> Paimon -> Delta -> Hudi -> Polaris
> > > fallback)? Who owns that order, should it be configurable by operators?
> > >
> > > 3. Should the per-format routing logic be centralised behind an
> > abstraction
> > > (e.g. a SubCatalogRouter interface or a provider registry), so that
> > adding
> > > a new format is a single registration rather than edits across
> loadTable,
> > > alterTable, and dropTable?
> > >
> > > 4. Consistency：Should all table operations (loadTable, createTable,
> > > alterTable, dropTable,
> > >
> > >   renameTable) follow the same routing strategy, or are per-operation
> > > differences acceptable? Currently createTable has a different branching
> > > structure from loadTable.
> > >
> > > 5. Is it in scope for Polaris to act as a routing layer for multiple
> > table
> > > providers, or should users who need both Polaris and Paimon configure
> > them
> > > as separate catalogs in their Spark session and route at the session
> > level
> > > themselves?
> > >
> > >
> > > We have a working Paimon implementation today and would like to avoid
> > > locking in a pattern that becomes hard to extend. Any input on the
> design
> > > direction, or pointers to prior discussion on this topic, would be much
> > > appreciated.
> > >
> > >
> > > Best regards,
> > >
> > > I-Ting
> > >
> >
>

Re: [DISCUSS] Generic table delegation strategy in Polaris SparkCatalog

Reply via email to