pvary commented on PR #8555: URL: https://github.com/apache/iceberg/pull/8555#issuecomment-1720983421
> @pvary > > > My concern here is that we tie ourselves to a "random" place of time to refresh the metadata. In my opinion we will have very specific events where we need refresh the metadata - namely, when we have a record with unexpected schema. The proposed solution does not point to this direction. > > I wouldn't say it is random. In the committer it happens after a commit. In the writer, it happens when a writer is initialized. I am fine with the commit - it was something similar already, as we had to reload the table before every commit, we just had to make sure that we do not have unexpected refresh on the table. The writer did refresh from time-ot-time in the middle of the code before your last change. > I have a working prototype of refreshing the table using the delegation manager. The main complication is that the manager is iniailized as part of the JM/TM initialization, not application initialization. This requires configuring the catalog for the delegation manager independently and introducing a separate configuration. It also requires the application run in the same classloader as the JM or TM. If those can be addressed then it should be easy to plug in a table supplier that reads from that instead of the catalog. I think the configuration handling could be solved - I have seen similar issues solved in production There are 2 possibilities to solve the classloader issues: - Put the provider/receiver/authentication to a plugin which are using the same JVM. This way the plugin classloader will be used for all of them. I expect the main hurdle here is to redirect the authentication to the plugin - Put everything on the main Flink classpath - this should be straightforward > Which part of this do you feed is not going in the right direction? The main change is introducing an abstraction that allows a table to be refreshed, which is the first step to any solution. The actual reloading table supplier is a very small part if this. This opens up a way to create very bad solutions (calling central components from Tasks - DDoS-ing catalogs), and restricts the future development (we needed only schema refresh for following schema evolution of the tables, and this could be solved by using schemastore solutions. Also the schemastore could be used not only in the Iceberg sink, but in previous steps of the jobs as well, so overall it seems a much better solution than refreshing the tables) Thanks, Peter -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
