gaborkaszab commented on issue #13800:
URL: https://github.com/apache/iceberg/issues/13800#issuecomment-3196775991

   Thanks for raising this, @henrib ! The algorithm you described above more or 
less resonates with the proposal I have for freshness-aware loading for REST 
catalog. I also had it in mind to investigate if the very same approach could 
be used for HiveCatalog too so thanks again for bringing this up!
   
   Where the REST-specific proposal deviates from yours are the following:
   1) What to use to describe particular state of a table
   For REST we agreed to use ETags that are returned by the REST server and is 
opaque (can't be assumed that it's current metadata location). Based on your 
description, for HiveCatalog you'd be more direct and explicitly use metadata 
location as 'identifiers' for state.
   2) Level of responsibilities
   What I mean here is that for the freshness-aware loading implementation for 
REST catalog is performed "seamlessly" and doesn't require any extra info to be 
exposed to the clients of the catalog. While the proposal you have would expose 
a metadata location and would leave the rest of the work to the clients.
   3) Number of catalog interactions
   The REST catalog proposal makes a single call to the REST catalog regardless 
if the table has been changed or not. With your proposal there is one call if 
the table hasn't changed or two if it has.
   
   I also checked the referenced Hive change and I have the impression the only 
usage of this exposed metadata location is going to be from the Hive/HMS code 
and it won't be leveraged from the Iceberg code, right? We can't expect a 
follow-up change to implement freshness-aware loading in HiveCatalog, right? It 
won't give any short term help for you but let me just brainstorm here:
   Since REST catalog will have it's own built-in, seamless freshness-aware 
loading implementation, I think we can gather the building blocks what it would 
require to have the same for HiveCatalog too. That way there won't be any need 
for the clients like HMS, Hive etc. to implement their own way of caching and 
refreshing. I myself aren't much familiar with HMS API (will check later on as 
homework), but I think the key would be to have a load table API (similarly to 
REST) where HMS can add an ETag to the result, and then the client can attach 
this ETag to the calls after that. Also this same API should give an `304-Not 
Modified` kind of error in a way to indicate to the client that the table 
hasn't changed since the ETag the client sent.
   This, I'm pretty sure would take some time to implement, release and have 
available in Iceberg, so I guess the question is that can HMS in the meantime 
live with each loadTable requests resulting a table load (and not just a 
metadata location load). I guess there are scenarios already (e.g. streaming 
ingestion) where the proposed Hive/HMS approach would simply got things worse 
(by getting location first and then load table second, because of the frequent 
changes).
   
   Sorry for the long message. LMK WDYT!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to