[GitHub] [iceberg] pvary commented on pull request #8555: Flink: provide table supplier in sink for optional table reload

via GitHub Fri, 15 Sep 2023 02:41:49 -0700


pvary commented on PR #8555:
URL: https://github.com/apache/iceberg/pull/8555#issuecomment-1720983421


   > @pvary
   > 
   > > My concern here is that we tie ourselves to a "random" place of time to 
refresh the metadata. In my opinion we will have very specific events where we 
need refresh the metadata - namely, when we have a record with unexpected 
schema. The proposed solution does not point to this direction.
   > 
   > I wouldn't say it is random. In the committer it happens after a commit. 
In the writer, it happens when a writer is initialized.
   
   I am fine with the commit - it was something similar already, as we had to 
reload the table before every commit, we just had to make sure that we do not 
have unexpected refresh on the table.
   The writer did refresh from time-ot-time in the middle of the code before 
your last change.
   
   > I have a working prototype of refreshing the table using the delegation 
manager. The main complication is that the manager is iniailized as part of the 
JM/TM initialization, not application initialization. This requires configuring 
the catalog for the delegation manager independently and introducing a separate 
configuration. It also requires the application run in the same classloader as 
the JM or TM. If those can be addressed then it should be easy to plug in a 
table supplier that reads from that instead of the catalog.
   
   I think the configuration handling could be solved - I have seen similar 
issues solved in production
   
   There are 2 possibilities to solve the classloader issues:
   - Put the provider/receiver/authentication to a plugin which are using the 
same JVM. This way the plugin classloader will be used for all of them. I 
expect the main hurdle here is to redirect the authentication to the plugin
   - Put everything on the main Flink classpath - this should be straightforward
   
   > Which part of this do you feed is not going in the right direction? The 
main change is introducing an abstraction that allows a table to be refreshed, 
which is the first step to any solution. The actual reloading table supplier is 
a very small part if this.
   
   This opens up a way to create very bad solutions (calling central components 
from Tasks - DDoS-ing catalogs), and restricts the future development (we 
needed only schema refresh for following schema evolution of the tables, and 
this could be solved by using schemastore solutions. Also the schemastore could 
be used not only in the Iceberg sink, but in previous steps of the jobs as 
well, so overall it seems a much better solution than refreshing the tables)
   
   Thanks,
   Peter
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] pvary commented on pull request #8555: Flink: provide table supplier in sink for optional table reload

Reply via email to