Re: [I] Support for Shallow Clone / Zero Copy Cloning in Apache Iceberg [iceberg]

via GitHub Mon, 29 Dec 2025 13:24:39 -0800


JarroVGIT commented on issue #12263:
URL: https://github.com/apache/iceberg/issues/12263#issuecomment-3697574473

Just to butt in, let me try to summarise:

- Zero-copy-cloning creates a table with an independent lifecycle of that of
the source table. This is both on a catalog level (for example, access control)
as on the storage level (e.g. when the source table is dropped, the clone
should not get corrupted).
- Branching in Iceberg does **not** create a table with an independent
lifecycle, but rather another version of the same logical table. As [per the
documentation](https://iceberg.apache.org/docs/1.10.0/branching/#schema-selection-with-branches-and-tags),
a prudent difference is that when you write data to a branch of a table, the
table-level schema must be used. When changing the schema of the table in the
main branch, this will automatically change the schema of the branched version
as well.

I can see a few issues with supporting this (although I would LOVE for this
to become a feature).

1. Any maintenance activity on the source (or actually, also on the clone)
might accidentally delete data files that are referenced by the other table.
This is a though problem, but I might have an idea on this. [1]
2. The clone is a completely new table, but its data files are in the base
location of the source table. The new storage location should not interfere
with the source table, so metadata must be written to a new location, but the
manifests will point to a "foreign" location. I can see how this might cause
issues in REST catalogs that do credential vending, as this is (afaik) based on
the base path of a table.

[1]: I think the spec can be extended somewhat to support this using tags
and additional table metadata. The key is to create a double reference (not
sure if this is a **good** idea, but it certainly is **an** idea). The source
table should know somehow that a clone was created from a certain snapshot-id,
and the clone should know what the source table was. I imagine a workflow as
something like:
1. Generate a new table UUID for the clone.
2. Create a tag on the source, extended with a "cloned-to-ids" array of
table references, include the cloned table UUID.
3. Copy the relevant table metadata (not the whole history, independent
lifecycle and all, just everything for the last snapshot-id) and store it in a
new location.
4. Add some sort of identifier that this table was a clone and reference
both the table and the metadata file it cloned from. I can think of two ways.
- Either add a new field to the table metadata like "cloned-from-id" with
the table UUID of the source table.
- Or, add a tag with a "cloned-from-id" field.

Now, this in combination with a few rules should bring us a whole end:
- Maintenance must never touch files outside the base path of a table (to
prevent the dropping of a clone to result in a deleted data files in the source
table).
- Snapshots referenced in tags with a non-empty "cloned-to-ids" array or a
non-empty "cloned-from-id" field are never to be expired.
- Table drops (especially PURGE) should respect still referenced
snapshot-id's (not sure if this is feasible to be honest, seems very expensive
to determine what is, and what isn't safe to delete from storage).

Now, we can have independent lifecycles:
- I can clone a table, and query it, without copying data files, delete
files, puffin files or even manifests and the manifest list.
- Doing things to the source table has no impact on the target table.
- Adding data to the target table, has no impact on the source table.
- Dropping the source table should only delete non-clone-tagged
data/delete/manifest files.
- Dropping the clone should remove the clone UUID from the array on the tag
in the table's metadata if the table still exists.
- If the snapshot-id (the one that was cloned) is expired in the clone, it
should remove the clone UUID from the array on the tag if the source table
still exists.

The big issue I see with this is lingering files: right now a table can be
dropped and purged, because all the files are in the same base location and
it's pretty much guaranteed that nobody depends on any of the files. By
preventing deletes through tags (if possible at all), it still is possible that
files are **never** deleted. For example, in the event of:
1. Clone a table.
2. Drop and purge the source (the cloned snapshot-id is somehow retained)
3. Drop and purge the clone (the files in the new base location are deleted).
This would result in lingering files in the source table. It is also not
possible to just delete the "left over files" in the source location with a
clone-purge, even if the "cloned-to-ids" array has only one element; there
could be another (later) version of that table that was also cloned which
relies on the same files as our original clone. Due to the immutability of
files, this is inevitable I think.

I tried to do this without a catalog, but the more I was writing this up,
the more I got convinced that this is, in fact, something that would require a
catalog. Catalogs are better positioned to keep track of all this, but the very
next question we would get is on "how to clone cross-catalog" probably, which
is (I think) impossible to codify currently. Or, maybe, somehow introduce a new
file in a well-known location that only lists snapshot-ids -> clones mappings
in an ascending manner, which is never purged until the last cloned table
removes its own (last) reference from it?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Support for Shallow Clone / Zero Copy Cloning in Apache Iceberg [iceberg]

Reply via email to