freesinger commented on issue #9706:
URL: https://github.com/apache/gravitino/issues/9706#issuecomment-4287082852
I think there are actually two different categories of external Lance tables
here:
- Tables that are **created and evolved only via Gravitino** (createTable
plus alter): the catalog already knows the location and has a stable way to
access storage, so implementing refresh-read / refresh-and-commit is relatively
straightforward.
- Tables that are **only registered** via registerTable (or createEmptyTable
in older versions): Gravitino only knows the location but may not have any
credentials or storage options to open the dataset, so we cannot safely assume
we can always fetch the “real” schema from storage.
For my own opinion, instead of trying to guarantee that “all out-of-band
changes are always reflected in Gravitino”, it might be more realistic to:
1. Introduce a **pluggable credential / storage-options provider** for the
Lance REST and generic lakehouse Lance catalog. Given a table location (and
catalog config), this provider returns the storage_options needed to open the
dataset (e.g. via env-based credentials, cloud roles, STS tokens or any
deployment-specific mechanism). If the provider cannot resolve credentials, we
fall back to strict-catalog.
2. Explicitly mark which external tables are **refreshable**. For example,
only tables that either:
- are created by Gravitino itself with proper storage configuration, or
- are registered with additional properties that the credential provider
can use,
can opt into refresh-read / refresh-and-commit. Other tables remain in
strict-catalog mode and are documented as “not refreshable”.
3. Treat schema sync for out-of-band changes as **best-effort lazy
reconciliation** rather than a strong guarantee. Users can either:
- request a refresh on read (e.g. via a request-level flag and
refresh-and-commit), or
- run an offline reconciliation job that scans external tables and updates
versions in Gravitino.
With this approach, a configuration like “refresh-and-commit for all
external tables” would really mean “for all external tables that are explicitly
marked as refreshable and for which the credential provider can open the
dataset”, otherwise we stay in the current strict-catalog behavior.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]