freesinger commented on issue #9706:
URL: https://github.com/apache/gravitino/issues/9706#issuecomment-4286455567
@yuqi1129
I think it's sweet to provide configurable way to determine how to update
Lance schema.
Short note and comparison table on schema refresh for Lance external tables
in Gravitino.
| Aspect | Current Gravitino
| "Real-time" in Lance (interpretation) |
Option A: Request-level flag | Option B:
Service-level config | Trade-offs
|
| :--------------------- |
:----------------------------------------------------------- |
:----------------------------------------------------------- |
:----------------------------------------------------------- |
:----------------------------------------------------------- |
:----------------------------------------------------------- |
| **Schema Source** | Reads from Gravitino's entity store, not the
physical dataset. | The Lance format supports retrieving the latest schema by
opening the dataset. | A request flag (e.g., `refresh=true`) would trigger a
direct read from storage. | A catalog/service-level config would set the
default refresh behavior for all reads. | Performance vs. consistency: catalog
cache is fast but can be stale; storage reads are accurate but slow. |
| **Version Sync** | `LANCE_TABLE_VERSION` property is persisted by
some write operations but not by `describe`. | REST spec's
`load_detailed_metadata=true` implies fetching the latest version from storage.
| The request could optionally commit the refreshed schema and version back to
the entity store. | Configuration could enforce that all reads on external
tables refresh and commit the schema. | Request-level gives flexibility;
service-level ensures consistency but may have wide performance impact. |
| **API Design** | No explicit parameter exists on `describeTable`
to force a storage read. | The Lance REST spec provides a
`load_detailed_metadata` flag for this purpose. | A new parameter like
`x-gravitino-refresh-schema=true` could be added to the REST API. | The
behavior would be implicit, controlled entirely by backend configuration
without API changes. | API flags are explicit but add complexity; config is
simpler for clients but less flexible. |
| **Recommended Policy** | N/A
| N/A | A
request parameter could select one of three modes for schema resolution. | A
service property would set the default policy for all external tables. | A
flexible policy balances performance, consistency, and operational needs for
different use cases. |
| **3-Mode Policy** | `strict-catalog`: Always reads from the entity
store (current default). | `refresh-read`: Reads from storage for the current
request only, without persisting. | `refresh-and-commit`: Reads from storage
and updates the entity store with the new schema. | These modes could be set as
the default behavior at the service or catalog level. | `strict-catalog` is
fast; `refresh-read` is good for ad-hoc checks; `refresh-and-commit` ensures
long-term consistency. |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]