Re: [DISCUSS] Generic table delegation strategy in Polaris SparkCatalog

ITing Lee Thu, 19 Mar 2026 03:29:35 -0700

Hi Yun,

Thank you for the detailed feedback！



1. Routing Abstraction: I agree it's the right time to move beyond if/else.
I will refactor the current logic (which explicitly checks for
Paimon/Delta/Hudi in each method) into a centralized routing abstraction
(maybe TableOperationsRouter). This will decouple SparkCatalog from
specific formats and ensure a consistent, extensible strategy across all
table operations.


2. Paimon Transaction / Dual-Registration: Regarding Paimon’s
non-delegating nature, I have already implemented the client-side
responsibility pattern you mentioned. In my current PR, the createTable
logic handles the physical creation first, followed by Polaris
registration. Crucially, I’ve included rollbackPaimonTableCreation in the
catch block to ensure atomicity dropping the physical table if registration
fails.


I’ve noted this topic is on the March 19 Community Sync agenda. I will
follow the meeting notes to see if there is any further consensus on this
dual-registration pattern before finalizing the failure handling logic.


I look forward to the insights from the sync.


Best regards,

I-Ting

ITing Lee <[email protected]> 於 2026年3月19日週四 下午4:22寫道：

> Hi Dmitri,
>
> Thanks for adding this to the Community Sync agenda and for keeping me in
> the loop.
>
> Since the meeting time is around midnight in my time zone, I won't be able
> to join live.
> Could you please confirm where I can find the meeting outcomes? Should I
> check the community Google Doc for the notes, or will there be a recording
> available?
>
> I look forward to the community's feedback from the sync. I'll follow up
> on the mailing list or the PR once I’ve had a chance to process the
> meeting's outcomes.
>
> Best regards,
>
> I-Ting
>
> yun zou <[email protected]> 於 2026年3月18日週三 上午8:12寫道：
>
>> Hi ITing,
>>
>> Thanks for bringing this up!
>>
>> *>>> Should Polaris determine the provider first (via metadata) and
>> delegate to a single matching catalog, or should it attempt multiple
>> sub-catalogs in a defined order? *
>>
>> *>>> If multiple sub-catalogs are supported, should there be a documented,
>> deterministic.*
>>
>> As Dimitri pointed out, Polaris Catalog today is designed to support mixed
>> table types. In other words, a single catalog (and namespace) can contain
>> Iceberg, Delta, and Hudi tables, and table identifiers must be unique
>> across all of them.
>>
>> Currently:
>>
>>    -
>>
>>    Iceberg tables are only visible through Iceberg endpoints
>>    -
>>
>>    Generic tables are only visible through generic table endpoints
>>    -
>>
>>    These two views are disjoint
>>
>> Because of this, to get a complete view of all tables in a catalog, we
>> need
>> to call listTables on both the Iceberg and generic endpoints.
>>
>> For loadTable, since we only have the table identifier and don’t know the
>> table type upfront, we may need to try both endpoints in the worst case.
>> Client-side table format caching could help optimize this in near future.
>>
>> Regarding ordering, there isn’t a strict or required sequence when
>> checking
>> different table types. For example, checking Generic first and then
>> Iceberg
>> (or vice versa) won’t change the outcome. The current approach of
>> attempting Iceberg first is simply a convention, not a requirement.
>>
>> *>>>  Should the per-format routing logic be centralised behind an
>> abstraction (e.g. a SubCatalogRouter interface or a provider registry), so
>> that adding a new format is a single registration rather than edits across
>> loadTable, alterTable, and dropTable? *
>>
>> I think the current if/else logic mainly exists because we didn’t have a
>> clear understanding of how different formats would behave on the client
>> side at the time. Now that Delta, Hudi, and Lance appear to follow a
>> similar pattern, it makes sense to extract a common routing abstraction.
>> That would definitely simplify the code and make adding new formats a
>> matter of registration rather than touching multiple code paths.
>>
>> *>>> Consistency：Should all table operations (loadTable, createTable,
>> alterTable, dropTable, renameTable) follow the same routing strategy, or
>> are per-operation differences acceptable? Currently createTable has a
>> different branching structure from loadTable.*
>>
>> In general, it would be good for most table operations (loadTable,
>> alterTable, dropTable, renameTable) to follow a consistent routing
>> strategy. However, createTable is a bit different — since we already know
>> the table format at creation time, we can directly route to the correct
>> endpoint. So I think it’s reasonable for createTable to have a different
>> branching structure.
>>
>> *>>> Is it in scope for Polaris to act as a routing layer for multiple
>> table providers, or should users who need both Polaris and Paimon
>> configure
>> them as separate catalogs in their Spark session and route at the session
>> level themselves?*
>>
>> Polaris Server itself doesn’t perform routing. This responsibility lies
>> with the Polaris Spark Client, which should determine the correct endpoint
>> to call for each operation.
>>
>> *>>> Paimon does not support a delegating catalog mode (unlike
>> Delta/Hudi),
>> it cannot automatically notify Polaris of its changes.*
>>
>> I may have missed this detail in the PR and will double-check. My
>> understanding is that Paimon’s SparkCatalog does not call into a REST
>> catalog as part of its table operations. In that case, it becomes the
>> client’s responsibility to ensure operations are executed correctly. If
>> needed, we could invoke operations twice, but we’d also need to ensure
>> proper failure handling — i.e., if any step fails, the operation should be
>> marked as failed and the transaction rolled back correctly.
>>
>>
>> Best Regards,
>>
>> Yun
>>
>> On Tue, Mar 17, 2026 at 7:45 AM Dmitri Bourlatchkov <[email protected]>
>> wrote:
>>
>> > Hi I-Ting,
>> >
>> > Unfortunately, I do not have an answer to your double registration
>> question
>> > off the top of my head, but I added an item for this discussion to the
>> > Community Sync [1] agenda for March 19.
>> >
>> > [1] https://polaris.apache.org/community/meetings/
>> >
>> > Cheers,
>> > Dmitri.
>> >
>> > On Tue, Mar 17, 2026 at 10:19 AM ITing Lee <[email protected]> wrote:
>> >
>> > > Hi Dmitri,
>> > >
>> > > Thank you for your clear guidance!
>> > >
>> > >
>> > > I completely agree with the unified namespace tree principle.
>> > >
>> > > To ensure Polaris acts as the single source of truth and avoids
>> > resolution
>> > > ambiguity, I will refactor the implementation to follow a lookup then
>> > > dispatch pattern.
>> > >
>> > > Instead of speculative probing, the sparkCatalog will first resolve
>> the
>> > > table entity via Polaris metadata to identify the provider, then
>> > > deterministically route the call or throw a Table format mismatch
>> error
>> > if
>> > > the API mode is incompatible.
>> > >
>> > >
>> > > I have another question regarding table registration for
>> non-delegating
>> > > formats.
>> > >
>> > > Since Paimon does not support a delegating catalog mode (unlike
>> > > Delta/Hudi), it cannot automatically notify Polaris of its changes.
>> > >
>> > > In my PR, I've implemented an explicit dual-registration during
>> > createTable
>> > > (Physical creation in Paimon warehouse followed by logical
>> registration
>> > in
>> > > Polaris).
>> > >
>> > > This ensures Paimon tables are visible via SHOW TABLES.
>> > >
>> > >
>> > > I would like to ask if the community has better ideas for handling
>> such
>> > > standalone formats? (From my perspective, the dual-registration is
>> not an
>> > > atomic operator for both systems. There's still a  chance that only
>> one
>> > of
>> > > the services succeeds but the other fails, which will cause
>> > inconsistency.
>> > > However, it _seems_ this is the only way to achieve it for
>> non-delegating
>> > > format.)
>> > >
>> > >
>> > > The alternative for having Polaris actively scan external warehouses
>> > which
>> > > seems to introduce significant performance overhead.
>> > >
>> > > Is there a more elegant way to ensure catalog visibility without
>> > > sacrificing the goal of single source of truth , or is this explicit
>> > > registration the preferred pattern for now?
>> > >
>> > >
>> > > Best regards,
>> > >
>> > > I-Ting
>> > >
>> > > Dmitri Bourlatchkov <[email protected]> 於 2026年3月16日週一 下午9:42寫道：
>> > >
>> > > > Hi I-Ting,
>> > > >
>> > > > Thanks for starting this discussion. You bring up important points.
>> > > >
>> > > > From my point of view, the catalog data controlled by Polaris should
>> > > form a
>> > > > unified namespace tree. In other words, each full table name owned
>> by
>> > > > Polaris must be unique and resolve to the same table entity
>> regardless
>> > of
>> > > > the API used by the client.
>> > > >
>> > > > If a name is accessed via the Icebert REST Catalog API and happens
>> to
>> > > point
>> > > > to a Paimon table, I think Polaris ought to report an error to the
>> > client
>> > > > (something like HTTP 422 "Table format mismatch").
>> > > >
>> > > > If a name is accessed via the Generic Tables API, the response must
>> > > > indicate actual table format.
>> > > >
>> > > > I do not think the client should make multiple "lookup" calls for
>> the
>> > > same
>> > > > table name. That creates ambiguity in the name resolution logic and
>> > could
>> > > > lead to different lookup results in different clients.
>> > > >
>> > > > I believe the client should select the API it wants to use (IRC or
>> > > Generic
>> > > > Tables) at setup time and then rely on that API for all primary
>> lookup
>> > > > calls.
>> > > >
>> > > > WDYT?
>> > > >
>> > > > Thanks,
>> > > > Dmitri.
>> > > >
>> > > > On Sat, Mar 14, 2026 at 3:34 AM 李宜頲 <[email protected]> wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > We are adding support for Paimon inside Polaris's SparkCatalog.
>> > Before
>> > > we
>> > > > > add more formats, we would like to get community input on the
>> > intended
>> > > > > architecture.
>> > > > >
>> > > > > This discussion originated from a code review conversation in PR
>> > #3820
>> > > > > <
>> https://github.com/apache/polaris/pull/3820#discussion_r2865885791>
>> > > > >
>> > > > >
>> > > > >
>> > > > > *Current design*
>> > > > >
>> > > > > When SparkCatalog.loadTable is called, the routing works in three
>> > > phases:
>> > > > >
>> > > > >
>> > > > > 1. Try the Iceberg catalog (icebergSparkCatalog.loadTable). If it
>> > > > succeeds,
>> > > > > return immediately.
>> > > > >
>> > > > > 2. Call getTableFormat(ident), which makes a single HTTP GET to
>> the
>> > > > Polaris
>> > > > > server to read the provider property stored in the generic table
>> > > > metadata,
>> > > > > without triggering any Spark DataSource resolution.
>> > > > >
>> > > > > 3. Route based on the provider string:
>> > > > >
>> > > > >     - "paimon"  : delegate to Paimon's SparkCatalog
>> > > > >
>> > > > >     - unknown/other : fall back to polarisSparkCatalog.loadTable,
>> > which
>> > > > > performs full DataSource resolution
>> > > > >
>> > > > >
>> > > > > The same three-phase pattern is repeated independently in
>> loadTable,
>> > > > > alterTable, and dropTable*（But createTable is not following this
>> > > > pattern)*.
>> > > > > It might raise the concern that this makes the routing logic
>> > intrusive:
>> > > > > every new format requires parallel changes across all three
>> methods,
>> > > and
>> > > > > there is no single place that describes the full routing policy.
>> > > > >
>> > > > >
>> > > > > *Questions for discussion*
>> > > > >
>> > > > >
>> > > > > 1. Should Polaris determine the provider first (via metadata) and
>> > > > delegate
>> > > > > to a single matching catalog, or should it attempt multiple
>> > > sub-catalogs
>> > > > in
>> > > > > a defined order?
>> > > > >
>> > > > > 2. If multiple sub-catalogs are supported, should there be a
>> > > documented,
>> > > > > deterministic
>> > > > >
>> > > > >   resolution order (Iceberg -> Paimon -> Delta -> Hudi -> Polaris
>> > > > > fallback)? Who owns that order, should it be configurable by
>> > operators?
>> > > > >
>> > > > > 3. Should the per-format routing logic be centralised behind an
>> > > > abstraction
>> > > > > (e.g. a SubCatalogRouter interface or a provider registry), so
>> that
>> > > > adding
>> > > > > a new format is a single registration rather than edits across
>> > > loadTable,
>> > > > > alterTable, and dropTable?
>> > > > >
>> > > > > 4. Consistency：Should all table operations (loadTable,
>> createTable,
>> > > > > alterTable, dropTable,
>> > > > >
>> > > > >   renameTable) follow the same routing strategy, or are
>> per-operation
>> > > > > differences acceptable? Currently createTable has a different
>> > branching
>> > > > > structure from loadTable.
>> > > > >
>> > > > > 5. Is it in scope for Polaris to act as a routing layer for
>> multiple
>> > > > table
>> > > > > providers, or should users who need both Polaris and Paimon
>> configure
>> > > > them
>> > > > > as separate catalogs in their Spark session and route at the
>> session
>> > > > level
>> > > > > themselves?
>> > > > >
>> > > > >
>> > > > > We have a working Paimon implementation today and would like to
>> avoid
>> > > > > locking in a pattern that becomes hard to extend. Any input on the
>> > > design
>> > > > > direction, or pointers to prior discussion on this topic, would be
>> > much
>> > > > > appreciated.
>> > > > >
>> > > > >
>> > > > > Best regards,
>> > > > >
>> > > > > I-Ting
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] Generic table delegation strategy in Polaris SparkCatalog

Reply via email to