freesinger commented on issue #9706:
URL: https://github.com/apache/gravitino/issues/9706#issuecomment-4286455567

   @yuqi1129 
   I think it's sweet to provide configurable way to determine how to update 
Lance schema.
   
   Short note and comparison table on schema refresh for Lance external tables 
in Gravitino.
   
   | Aspect                 | Current Gravitino                                 
           | "Real-time" in Lance (interpretation)                        | 
Option A: Request-level flag                                 | Option B: 
Service-level config                               | Trade-offs                 
                                  |
   | :--------------------- | 
:----------------------------------------------------------- | 
:----------------------------------------------------------- | 
:----------------------------------------------------------- | 
:----------------------------------------------------------- | 
:----------------------------------------------------------- |
   | **Schema Source**      | Reads from Gravitino's entity store, not the 
physical dataset. | The Lance format supports retrieving the latest schema by 
opening the dataset. | A request flag (e.g., `refresh=true`) would trigger a 
direct read from storage. | A catalog/service-level config would set the 
default refresh behavior for all reads. | Performance vs. consistency: catalog 
cache is fast but can be stale; storage reads are accurate but slow. |
   | **Version Sync**       | `LANCE_TABLE_VERSION` property is persisted by 
some write operations but not by `describe`. | REST spec's 
`load_detailed_metadata=true` implies fetching the latest version from storage. 
| The request could optionally commit the refreshed schema and version back to 
the entity store. | Configuration could enforce that all reads on external 
tables refresh and commit the schema. | Request-level gives flexibility; 
service-level ensures consistency but may have wide performance impact. |
   | **API Design**         | No explicit parameter exists on `describeTable` 
to force a storage read. | The Lance REST spec provides a 
`load_detailed_metadata` flag for this purpose. | A new parameter like 
`x-gravitino-refresh-schema=true` could be added to the REST API. | The 
behavior would be implicit, controlled entirely by backend configuration 
without API changes. | API flags are explicit but add complexity; config is 
simpler for clients but less flexible. |
   | **Recommended Policy** | N/A                                               
           | N/A                                                          | A 
request parameter could select one of three modes for schema resolution. | A 
service property would set the default policy for all external tables. | A 
flexible policy balances performance, consistency, and operational needs for 
different use cases. |
   | **3-Mode Policy**      | `strict-catalog`: Always reads from the entity 
store (current default). | `refresh-read`: Reads from storage for the current 
request only, without persisting. | `refresh-and-commit`: Reads from storage 
and updates the entity store with the new schema. | These modes could be set as 
the default behavior at the service or catalog level. | `strict-catalog` is 
fast; `refresh-read` is good for ad-hoc checks; `refresh-and-commit` ensures 
long-term consistency. |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to