gaborkaszab commented on PR #14137: URL: https://github.com/apache/iceberg/pull/14137#issuecomment-3318962619
Thanks for the PR @okumin ! Also, the sequence diagram in the linked Hive PR was very useful to understand the use-case. If I'm not mistaken the motivation for this is to introduce a server-side Table/TableMetadata cache within the HMS implementation of the REST catalog and the original approach didn't work out because there is no catalog API to expose metadata location without loading the whole table. Is my assumption correct? As an initial step I'd recommend to check if there is community support for such a broad change via asking on dev@. The reason I think this is needed, because this PR seems to affect broadly all the catalogs that load the table metadata from storage using a metadata location. Also, some of my previous experiences showed that the size of the metadata.jsons could grow pretty big, and I'm wondering if there is any study on your side, what are the optional configurations for the size of the cache and the max size of the metadata.jsons. I'd be worried that in a real world scenario only the small tables would fit into the cache anyway. Do you cache the content of the metadata.json and not the compressed gzip version that is stored on storage, right? ======= Some thinking and side conversation ======= I was thinking on the architecture of the HMS-based REST catalog and maybe the root cause of these struggles to implement the server-side cache there. Let me know if I miss something. As I see the architecture is this for a load table: 1) Call loadTable API on HMS-based REST catalog 2) Internally call HiveCatalog's loadTable 3) This calls HMS's loadTable that returns the metadata location 4) Internal HiveCatalog uses the metadata location from HMS to load the TableMetadata from storage 5) Internal HiveCatalog returns Table object 6) LoadTableResponse is constructed and returned from HMS-based REST catalog What I don't exactly see is why there is a need for the internal HiveCatalog other than convenience to connect to HMS (but we are already in HMS, right? Maybe different process, though). Alternatively the sequence could be this, eliminating the need for an API to get metadata location and also for the TableMetadataParser cache: 1) Call loadTable API on HMS-based REST catalog 2) Call HMS's loadTable directly to get the metadata location 3) Have a cache in HMS-based REST catalog (store metadata locations, ETags, Table objects, etc.). Check if the table has been changed using the cache 4 a) If the table has been changed, do full table load through HiveCatalog, alternatively through TableMetadataParser using the metadata location 4 b) If the table hasn't been changed, answer the request from cache. Or send a 304-NotModified depending on the use-case Would this make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
