gaborkaszab commented on PR #16207: URL: https://github.com/apache/iceberg/pull/16207#issuecomment-4389536373
@grantatspothero I agree with 1). In fact I've done some research and Polaris for instance sends a reduced snapshotLog not just reduced list of snapshots in REFS mode. This ATM is a broken contract between Polaris and Iceberg IMO. You're right, if we don't include snapshotLog into the lazy loading mechanism then the internal state of TableMetadata will be broken after we lazily load the snapshots. The reason why we haven't caught this is that in the reference IRC we always populated the full snapshotLog even in REFS mode (see `suppressHistoricalSnapshots`), so no tests actually covered a reduced snapshotLog. cc @amogh-jahagirdar to double-check but I think we have an issue here between Polaris and Iceberg that this PR can fix. I disagree with 2) though, or I misunderstand the motivation :) Let me explain: That change of `suppressHistoricalSnapshots ` on the REST server side is for testing, to simulate the behavior of a potential REST server implementation (unless you use the reference IRC for your production REST catalog, that I wouldn't advise). So what you did with that change is you simulated a REST catalog server that reduces the snapshotLog not just the list of snapshots. This doesn't mean that all the REST server implementations will work like this once the PR is merged. There could be ones that still send a full snapshotLog. That's why I said that for REST servers that reduced the snapshotLog, your memory concerns are already gone (we'd still need the lazy loading, though), but for REST servers that send the full snapshotLog we still load the full log on the client side. This is what I tried to show with my commit, that if a server still sends the full list, then this PR doesn't solve the caching issue. I might overthink this. In case you have full control on the REST catalog server implementation (i.e. you have your own proprietary implementation) then you're fine here. In case you don't and you connect to any REST catalogs out there, then you have no control on whether they send filtered or full snapshotLog in REFS mode. Sorry if I got this too long or complicated, hope it makes sense! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
