Re: [DISCUSS] Generic table delegation strategy in Polaris SparkCatalog

ITing Lee Thu, 19 Mar 2026 01:23:18 -0700

Hi Dmitri,

Thanks for adding this to the Community Sync agenda and for keeping me in
the loop.


Since the meeting time is around midnight in my time zone, I won't be able
to join live.
Could you please confirm where I can find the meeting outcomes? Should I
check the community Google Doc for the notes, or will there be a recording
available?

I look forward to the community's feedback from the sync. I'll follow up on
the mailing list or the PR once I’ve had a chance to process the meeting's
outcomes.

Best regards,

I-Ting

yun zou <[email protected]> 於 2026年3月18日週三 上午8:12寫道：

> Hi ITing,
>
> Thanks for bringing this up!
>
> *>>> Should Polaris determine the provider first (via metadata) and
> delegate to a single matching catalog, or should it attempt multiple
> sub-catalogs in a defined order? *
>
> *>>> If multiple sub-catalogs are supported, should there be a documented,
> deterministic.*
>
> As Dimitri pointed out, Polaris Catalog today is designed to support mixed
> table types. In other words, a single catalog (and namespace) can contain
> Iceberg, Delta, and Hudi tables, and table identifiers must be unique
> across all of them.
>
> Currently:
>
>    -
>
>    Iceberg tables are only visible through Iceberg endpoints
>    -
>
>    Generic tables are only visible through generic table endpoints
>    -
>
>    These two views are disjoint
>
> Because of this, to get a complete view of all tables in a catalog, we need
> to call listTables on both the Iceberg and generic endpoints.
>
> For loadTable, since we only have the table identifier and don’t know the
> table type upfront, we may need to try both endpoints in the worst case.
> Client-side table format caching could help optimize this in near future.
>
> Regarding ordering, there isn’t a strict or required sequence when checking
> different table types. For example, checking Generic first and then Iceberg
> (or vice versa) won’t change the outcome. The current approach of
> attempting Iceberg first is simply a convention, not a requirement.
>
> *>>>  Should the per-format routing logic be centralised behind an
> abstraction (e.g. a SubCatalogRouter interface or a provider registry), so
> that adding a new format is a single registration rather than edits across
> loadTable, alterTable, and dropTable? *
>
> I think the current if/else logic mainly exists because we didn’t have a
> clear understanding of how different formats would behave on the client
> side at the time. Now that Delta, Hudi, and Lance appear to follow a
> similar pattern, it makes sense to extract a common routing abstraction.
> That would definitely simplify the code and make adding new formats a
> matter of registration rather than touching multiple code paths.
>
> *>>> Consistency：Should all table operations (loadTable, createTable,
> alterTable, dropTable, renameTable) follow the same routing strategy, or
> are per-operation differences acceptable? Currently createTable has a
> different branching structure from loadTable.*
>
> In general, it would be good for most table operations (loadTable,
> alterTable, dropTable, renameTable) to follow a consistent routing
> strategy. However, createTable is a bit different — since we already know
> the table format at creation time, we can directly route to the correct
> endpoint. So I think it’s reasonable for createTable to have a different
> branching structure.
>
> *>>> Is it in scope for Polaris to act as a routing layer for multiple
> table providers, or should users who need both Polaris and Paimon configure
> them as separate catalogs in their Spark session and route at the session
> level themselves?*
>
> Polaris Server itself doesn’t perform routing. This responsibility lies
> with the Polaris Spark Client, which should determine the correct endpoint
> to call for each operation.
>
> *>>> Paimon does not support a delegating catalog mode (unlike Delta/Hudi),
> it cannot automatically notify Polaris of its changes.*
>
> I may have missed this detail in the PR and will double-check. My
> understanding is that Paimon’s SparkCatalog does not call into a REST
> catalog as part of its table operations. In that case, it becomes the
> client’s responsibility to ensure operations are executed correctly. If
> needed, we could invoke operations twice, but we’d also need to ensure
> proper failure handling — i.e., if any step fails, the operation should be
> marked as failed and the transaction rolled back correctly.
>
>
> Best Regards,
>
> Yun
>
> On Tue, Mar 17, 2026 at 7:45 AM Dmitri Bourlatchkov <[email protected]>
> wrote:
>
> > Hi I-Ting,
> >
> > Unfortunately, I do not have an answer to your double registration
> question
> > off the top of my head, but I added an item for this discussion to the
> > Community Sync [1] agenda for March 19.
> >
> > [1] https://polaris.apache.org/community/meetings/
> >
> > Cheers,
> > Dmitri.
> >
> > On Tue, Mar 17, 2026 at 10:19 AM ITing Lee <[email protected]> wrote:
> >
> > > Hi Dmitri,
> > >
> > > Thank you for your clear guidance!
> > >
> > >
> > > I completely agree with the unified namespace tree principle.
> > >
> > > To ensure Polaris acts as the single source of truth and avoids
> > resolution
> > > ambiguity, I will refactor the implementation to follow a lookup then
> > > dispatch pattern.
> > >
> > > Instead of speculative probing, the sparkCatalog will first resolve the
> > > table entity via Polaris metadata to identify the provider, then
> > > deterministically route the call or throw a Table format mismatch error
> > if
> > > the API mode is incompatible.
> > >
> > >
> > > I have another question regarding table registration for non-delegating
> > > formats.
> > >
> > > Since Paimon does not support a delegating catalog mode (unlike
> > > Delta/Hudi), it cannot automatically notify Polaris of its changes.
> > >
> > > In my PR, I've implemented an explicit dual-registration during
> > createTable
> > > (Physical creation in Paimon warehouse followed by logical registration
> > in
> > > Polaris).
> > >
> > > This ensures Paimon tables are visible via SHOW TABLES.
> > >
> > >
> > > I would like to ask if the community has better ideas for handling such
> > > standalone formats? (From my perspective, the dual-registration is not
> an
> > > atomic operator for both systems. There's still a  chance that only one
> > of
> > > the services succeeds but the other fails, which will cause
> > inconsistency.
> > > However, it _seems_ this is the only way to achieve it for
> non-delegating
> > > format.)
> > >
> > >
> > > The alternative for having Polaris actively scan external warehouses
> > which
> > > seems to introduce significant performance overhead.
> > >
> > > Is there a more elegant way to ensure catalog visibility without
> > > sacrificing the goal of single source of truth , or is this explicit
> > > registration the preferred pattern for now?
> > >
> > >
> > > Best regards,
> > >
> > > I-Ting
> > >
> > > Dmitri Bourlatchkov <[email protected]> 於 2026年3月16日週一 下午9:42寫道：
> > >
> > > > Hi I-Ting,
> > > >
> > > > Thanks for starting this discussion. You bring up important points.
> > > >
> > > > From my point of view, the catalog data controlled by Polaris should
> > > form a
> > > > unified namespace tree. In other words, each full table name owned by
> > > > Polaris must be unique and resolve to the same table entity
> regardless
> > of
> > > > the API used by the client.
> > > >
> > > > If a name is accessed via the Icebert REST Catalog API and happens to
> > > point
> > > > to a Paimon table, I think Polaris ought to report an error to the
> > client
> > > > (something like HTTP 422 "Table format mismatch").
> > > >
> > > > If a name is accessed via the Generic Tables API, the response must
> > > > indicate actual table format.
> > > >
> > > > I do not think the client should make multiple "lookup" calls for the
> > > same
> > > > table name. That creates ambiguity in the name resolution logic and
> > could
> > > > lead to different lookup results in different clients.
> > > >
> > > > I believe the client should select the API it wants to use (IRC or
> > > Generic
> > > > Tables) at setup time and then rely on that API for all primary
> lookup
> > > > calls.
> > > >
> > > > WDYT?
> > > >
> > > > Thanks,
> > > > Dmitri.
> > > >
> > > > On Sat, Mar 14, 2026 at 3:34 AM 李宜頲 <[email protected]> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > We are adding support for Paimon inside Polaris's SparkCatalog.
> > Before
> > > we
> > > > > add more formats, we would like to get community input on the
> > intended
> > > > > architecture.
> > > > >
> > > > > This discussion originated from a code review conversation in PR
> > #3820
> > > > > <
> https://github.com/apache/polaris/pull/3820#discussion_r2865885791>
> > > > >
> > > > >
> > > > >
> > > > > *Current design*
> > > > >
> > > > > When SparkCatalog.loadTable is called, the routing works in three
> > > phases:
> > > > >
> > > > >
> > > > > 1. Try the Iceberg catalog (icebergSparkCatalog.loadTable). If it
> > > > succeeds,
> > > > > return immediately.
> > > > >
> > > > > 2. Call getTableFormat(ident), which makes a single HTTP GET to the
> > > > Polaris
> > > > > server to read the provider property stored in the generic table
> > > > metadata,
> > > > > without triggering any Spark DataSource resolution.
> > > > >
> > > > > 3. Route based on the provider string:
> > > > >
> > > > >     - "paimon"  : delegate to Paimon's SparkCatalog
> > > > >
> > > > >     - unknown/other : fall back to polarisSparkCatalog.loadTable,
> > which
> > > > > performs full DataSource resolution
> > > > >
> > > > >
> > > > > The same three-phase pattern is repeated independently in
> loadTable,
> > > > > alterTable, and dropTable*（But createTable is not following this
> > > > pattern)*.
> > > > > It might raise the concern that this makes the routing logic
> > intrusive:
> > > > > every new format requires parallel changes across all three
> methods,
> > > and
> > > > > there is no single place that describes the full routing policy.
> > > > >
> > > > >
> > > > > *Questions for discussion*
> > > > >
> > > > >
> > > > > 1. Should Polaris determine the provider first (via metadata) and
> > > > delegate
> > > > > to a single matching catalog, or should it attempt multiple
> > > sub-catalogs
> > > > in
> > > > > a defined order?
> > > > >
> > > > > 2. If multiple sub-catalogs are supported, should there be a
> > > documented,
> > > > > deterministic
> > > > >
> > > > >   resolution order (Iceberg -> Paimon -> Delta -> Hudi -> Polaris
> > > > > fallback)? Who owns that order, should it be configurable by
> > operators?
> > > > >
> > > > > 3. Should the per-format routing logic be centralised behind an
> > > > abstraction
> > > > > (e.g. a SubCatalogRouter interface or a provider registry), so that
> > > > adding
> > > > > a new format is a single registration rather than edits across
> > > loadTable,
> > > > > alterTable, and dropTable?
> > > > >
> > > > > 4. Consistency：Should all table operations (loadTable, createTable,
> > > > > alterTable, dropTable,
> > > > >
> > > > >   renameTable) follow the same routing strategy, or are
> per-operation
> > > > > differences acceptable? Currently createTable has a different
> > branching
> > > > > structure from loadTable.
> > > > >
> > > > > 5. Is it in scope for Polaris to act as a routing layer for
> multiple
> > > > table
> > > > > providers, or should users who need both Polaris and Paimon
> configure
> > > > them
> > > > > as separate catalogs in their Spark session and route at the
> session
> > > > level
> > > > > themselves?
> > > > >
> > > > >
> > > > > We have a working Paimon implementation today and would like to
> avoid
> > > > > locking in a pattern that becomes hard to extend. Any input on the
> > > design
> > > > > direction, or pointers to prior discussion on this topic, would be
> > much
> > > > > appreciated.
> > > > >
> > > > >
> > > > > Best regards,
> > > > >
> > > > > I-Ting
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Generic table delegation strategy in Polaris SparkCatalog

Reply via email to