gaborkaszab commented on issue #13800: URL: https://github.com/apache/iceberg/issues/13800#issuecomment-3196775991
Thanks for raising this, @henrib ! The algorithm you described above more or less resonates with the proposal I have for freshness-aware loading for REST catalog. I also had it in mind to investigate if the very same approach could be used for HiveCatalog too so thanks again for bringing this up! Where the REST-specific proposal deviates from yours are the following: 1) What to use to describe particular state of a table For REST we agreed to use ETags that are returned by the REST server and is opaque (can't be assumed that it's current metadata location). Based on your description, for HiveCatalog you'd be more direct and explicitly use metadata location as 'identifiers' for state. 2) Level of responsibilities What I mean here is that for the freshness-aware loading implementation for REST catalog is performed "seamlessly" and doesn't require any extra info to be exposed to the clients of the catalog. While the proposal you have would expose a metadata location and would leave the rest of the work to the clients. 3) Number of catalog interactions The REST catalog proposal makes a single call to the REST catalog regardless if the table has been changed or not. With your proposal there is one call if the table hasn't changed or two if it has. I also checked the referenced Hive change and I have the impression the only usage of this exposed metadata location is going to be from the Hive/HMS code and it won't be leveraged from the Iceberg code, right? We can't expect a follow-up change to implement freshness-aware loading in HiveCatalog, right? It won't give any short term help for you but let me just brainstorm here: Since REST catalog will have it's own built-in, seamless freshness-aware loading implementation, I think we can gather the building blocks what it would require to have the same for HiveCatalog too. That way there won't be any need for the clients like HMS, Hive etc. to implement their own way of caching and refreshing. I myself aren't much familiar with HMS API (will check later on as homework), but I think the key would be to have a load table API (similarly to REST) where HMS can add an ETag to the result, and then the client can attach this ETag to the calls after that. Also this same API should give an `304-Not Modified` kind of error in a way to indicate to the client that the table hasn't changed since the ETag the client sent. This, I'm pretty sure would take some time to implement, release and have available in Iceberg, so I guess the question is that can HMS in the meantime live with each loadTable requests resulting a table load (and not just a metadata location load). I guess there are scenarios already (e.g. streaming ingestion) where the proposed Hive/HMS approach would simply got things worse (by getting location first and then load table second, because of the frequent changes). Sorry for the long message. LMK WDYT! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
