JarroVGIT commented on issue #12263: URL: https://github.com/apache/iceberg/issues/12263#issuecomment-3697574473
Just to butt in, let me try to summarise: - Zero-copy-cloning creates a table with an independent lifecycle of that of the source table. This is both on a catalog level (for example, access control) as on the storage level (e.g. when the source table is dropped, the clone should not get corrupted). - Branching in Iceberg does **not** create a table with an independent lifecycle, but rather another version of the same logical table. As [per the documentation](https://iceberg.apache.org/docs/1.10.0/branching/#schema-selection-with-branches-and-tags), a prudent difference is that when you write data to a branch of a table, the table-level schema must be used. When changing the schema of the table in the main branch, this will automatically change the schema of the branched version as well. I can see a few issues with supporting this (although I would LOVE for this to become a feature). 1. Any maintenance activity on the source (or actually, also on the clone) might accidentally delete data files that are referenced by the other table. This is a though problem, but I might have an idea on this. [1] 2. The clone is a completely new table, but its data files are in the base location of the source table. The new storage location should not interfere with the source table, so metadata must be written to a new location, but the manifests will point to a "foreign" location. I can see how this might cause issues in REST catalogs that do credential vending, as this is (afaik) based on the base path of a table. [1]: I think the spec can be extended somewhat to support this using tags and additional table metadata. The key is to create a double reference (not sure if this is a **good** idea, but it certainly is **an** idea). The source table should know somehow that a clone was created from a certain snapshot-id, and the clone should know what the source table was. I imagine a workflow as something like: 1. Generate a new table UUID for the clone. 2. Create a tag on the source, extended with a "cloned-to-ids" array of table references, include the cloned table UUID. 3. Copy the relevant table metadata (not the whole history, independent lifecycle and all, just everything for the last snapshot-id) and store it in a new location. 4. Add some sort of identifier that this table was a clone and reference both the table and the metadata file it cloned from. I can think of two ways. - Either add a new field to the table metadata like "cloned-from-id" with the table UUID of the source table. - Or, add a tag with a "cloned-from-id" field. Now, this in combination with a few rules should bring us a whole end: - Maintenance must never touch files outside the base path of a table (to prevent the dropping of a clone to result in a deleted data files in the source table). - Snapshots referenced in tags with a non-empty "cloned-to-ids" array or a non-empty "cloned-from-id" field are never to be expired. - Table drops (especially PURGE) should respect still referenced snapshot-id's (not sure if this is feasible to be honest, seems very expensive to determine what is, and what isn't safe to delete from storage). Now, we can have independent lifecycles: - I can clone a table, and query it, without copying data files, delete files, puffin files or even manifests and the manifest list. - Doing things to the source table has no impact on the target table. - Adding data to the target table, has no impact on the source table. - Dropping the source table should only delete non-clone-tagged data/delete/manifest files. - Dropping the clone should remove the clone UUID from the array on the tag in the table's metadata if the table still exists. - If the snapshot-id (the one that was cloned) is expired in the clone, it should remove the clone UUID from the array on the tag if the source table still exists. The big issue I see with this is lingering files: right now a table can be dropped and purged, because all the files are in the same base location and it's pretty much guaranteed that nobody depends on any of the files. By preventing deletes through tags (if possible at all), it still is possible that files are **never** deleted. For example, in the event of: 1. Clone a table. 2. Drop and purge the source (the cloned snapshot-id is somehow retained) 3. Drop and purge the clone (the files in the new base location are deleted). This would result in lingering files in the source table. It is also not possible to just delete the "left over files" in the source location with a clone-purge, even if the "cloned-to-ids" array has only one element; there could be another (later) version of that table that was also cloned which relies on the same files as our original clone. Due to the immutability of files, this is inevitable I think. I tried to do this without a catalog, but the more I was writing this up, the more I got convinced that this is, in fact, something that would require a catalog. Catalogs are better positioned to keep track of all this, but the very next question we would get is on "how to clone cross-catalog" probably, which is (I think) impossible to codify currently. Or, maybe, somehow introduce a new file in a well-known location that only lists snapshot-ids -> clones mappings in an ascending manner, which is never purged until the last cloned table removes its own (last) reference from it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
