freesinger commented on issue #9706:
URL: https://github.com/apache/gravitino/issues/9706#issuecomment-4287082852

   I think there are actually two different categories of external Lance tables 
here:
   
   - Tables that are **created and evolved only via Gravitino** (createTable 
plus alter): the catalog already knows the location and has a stable way to 
access storage, so implementing refresh-read / refresh-and-commit is relatively 
straightforward.
   - Tables that are **only registered** via registerTable (or createEmptyTable 
in older versions): Gravitino only knows the location but may not have any 
credentials or storage options to open the dataset, so we cannot safely assume 
we can always fetch the “real” schema from storage.
   
   For my own opinion, instead of trying to guarantee that “all out-of-band 
changes are always reflected in Gravitino”, it might be more realistic to:
   
   1. Introduce a **pluggable credential / storage-options provider** for the 
Lance REST and generic lakehouse Lance catalog. Given a table location (and 
catalog config), this provider returns the storage_options needed to open the 
dataset (e.g. via env-based credentials, cloud roles, STS tokens or any 
deployment-specific mechanism). If the provider cannot resolve credentials, we 
fall back to strict-catalog.
    
   2. Explicitly mark which external tables are **refreshable**. For example, 
only tables that either:
     - are created by Gravitino itself with proper storage configuration, or
     - are registered with additional properties that the credential provider 
can use,
   can opt into refresh-read / refresh-and-commit. Other tables remain in 
strict-catalog mode and are documented as “not refreshable”.
    
   3. Treat schema sync for out-of-band changes as **best-effort lazy 
reconciliation** rather than a strong guarantee. Users can either:
   - request a refresh on read (e.g. via a request-level flag and 
refresh-and-commit), or
   - run an offline reconciliation job that scans external tables and updates 
versions in Gravitino.
   
   With this approach, a configuration like “refresh-and-commit for all 
external tables” would really mean “for all external tables that are explicitly 
marked as refreshable and for which the credential provider can open the 
dataset”, otherwise we stay in the current strict-catalog behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to