JarroVGIT commented on issue #12263: URL: https://github.com/apache/iceberg/issues/12263#issuecomment-3697676433
@RussellSpitzer : I agree, tagging has its shortcomings in this and my post can be considered more of a thought experiment than a well thought-out proposal. It got pretty far but it breaks quickly with less simple scenario's. With regards to what's missing in branching; I think it's primarily a matter of independent lifecycles. Tracking the schema id in the branch is definitely a step in the right direction I think, but it would break the current feature (and I assume, intended purpose) of branching (mainly; inserted data in a branch is validated against the schema of the table, ensuring a possible fastforward to the main branch later on). But a branch is inherently always a child resource of a table, and not a descendant. This means that there will always be a contention; if I have 50 active branches (and yes, this is a real scenario[^1]) I suddenly have 50 writers trying to update the table metadata concurrently. Cloning would create a new table, but with existing data files, so in this example, all writers would write to their own metadata files with no contention. Access control is another aspect, but as stated earlier by someone that seems more like a catalog concern (and I even think that is already possible to implement). Other examples where branches will interfere with one another and are lacking an independent lifecycle are: - `next-row-id` on the table is updated across branches. - history within a branch is mixed with the table history; if a branch is updated, the newest table metadata will show the new snapshot-id as the branch, but you must apply a two-step lookup to determine the previous snapshot-id of that branch (mainly: look up the snapshot-id prior to the current snapshot-id of the branch, load that metadata and from there look up what the snapshot-id of the branch was at that point in time). So, yeah, reading your first sentence again: > Basically the introduction the catalog brings us back to the idea that we need to centralize the information about which tables own which snapshots in the same system. I think you are correct; this is hardly possible without a centralised system that tracks this information across tables. Would this then be more suitable as an evolution on the REST spec you think? [^1]: I have seen test suites that run several integration tests in parallel on clones of production tables, for example. Another example is a research department where clones are used for experimentation, where dozens of people work on their own clones of the same production table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
