Fokko opened a new issue, #700: URL: https://github.com/apache/iceberg-rust/issues/700
# Iceberg-rust Write support I've noticed a lot of interest in write support in Iceberg-rust. This issue aims to break this down into smaller pieces so they can be picked up in parallel. ## Commit path The commit path entails writing a new metadata JSON. - [ ] **Applying updates to the metadata** [Updating the metadata](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L706-L956) is important both for writing a new version of the JSON in case of a non-REST catalog, but also to keep an up-to-date version in memory. It is recommended to re-use the [Updates](https://github.com/apache/iceberg/blob/866021d7d34f274349ce7de1f29d113395e7f28c/open-api/rest-catalog-open-api.yaml#L2557-L2575)/[Requirement](https://github.com/apache/iceberg/blob/866021d7d34f274349ce7de1f29d113395e7f28c/open-api/rest-catalog-open-api.yaml#L2588-L2605) objects provided by the [REST catalog protocol](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml). PyIceberg uses a [similar](https://github.com/apache/iceberg-python/blob/b2f0a9e5cd7dd548e19cdcdd7f9205f03454369a/pyiceberg/table/update/__init__.py#L244) approach. - [x] **REST Catalog** [serialize the updates and requirements](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L1364-L1373) into JSON which is [dispatched to the REST catalog](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/rest.py#L675-L706). Done in https://github.com/apache/iceberg-rust/pull/97. - [ ] **Other catalogs** For the other catalogs, instead of dispatching the updates/requirements to the catalog. There are additional steps: - [ ] Logic to [validate the requirements](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/glue.py#L453-L455) against the metadata, to detect commit conflicts. A lot of this logic is already being implemented by https://github.com/apache/iceberg-rust/pull/587. - [ ] Writing a new version of the [metadata.json](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/__init__.py#L775-L776) to the object store. Taking into account the naming [as mentioned in the spec](https://github.com/apache/iceberg/blob/main/format/spec.md#metastore-tables). - [ ] Provide locking mechanisms within the commit ([Glue](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/glue.py#L476-L483), [Hive](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/hive.py#L379), [SQL](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/sql.py#L426), ..) so the atomic swap happens safely. - [ ] **SQL** Looks like conflict [detection is missing](https://github.com/apache/iceberg-rust/blob/50345196c87b00badc1a6490aef284e84f4c3e9a/crates/catalog/sql/src/catalog.rs#L475). I was expecting logic there to see if rows are being affected (if not, another process has altered the table). - [ ] **Update table properties** Sets [properties on the table](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L326). Probably the best to start with since it doesn't require a complicated API. - [ ] **Schema evolution** [API](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L1809) to update the schema, and produce new metadata. - [ ] Having the SchemaUpdate API to evolve the schema without an user have to worry about field-IDs: https://github.com/apache/iceberg-rust/issues/697 - [ ] Add the `unionByName` to easily union two schemas to provide easy schema evolution: https://github.com/apache/iceberg-rust/issues/698 - [ ] **Partition spec evolution** [API](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L3003) to update the partition spec, and produce new metadata. - [ ] **Sort order evolution** API to update the schema, and produce new metadata. - [ ] **Commit semantics** - [ ] **MergeAppend** appends new manifest list entries to existing manifest files. Reduces the amount of metadata produced, but takes some more time to commit since existing metadata has to be rewritten, and retries are also more costly. - [ ] **FastAppend** Generates a new manifest per commit, which allows fast commits, but generates more metadata in the long run. PR by @ZENOTME in https://github.com/apache/iceberg-rust/pull/349 - [ ] **Snapshot generation** manipulation of data within a table is done by [appending snapshots](https://iceberg.apache.org/spec/#snapshots) to the metadata JSON. - [ ] **APPEND** Only data files were added and no files were removed. - [ ] **REPLACE** Data and delete files were added and removed without changing table data; i.e., compaction, changing the data file format, or relocating data files. - [ ] **OVERWRITE** Data and delete files were added and removed in a logical overwrite operation. - [ ] **DELETE** Data files were removed and their contents logically deleted and/or delete files were added to delete rows. - [ ] **Add files** to add existing Parquet files to a table. Issue in https://github.com/apache/iceberg-rust/issues/345 - [ ] [**Name mapping**](https://iceberg.apache.org/spec/#column-projection) in case the files don't have field-IDs set. - [ ] [**Summary generations**] Part of the snapshot that indicates what's in the snapshot. - [ ] **Metrics collection** There are two situations: - [ ] **Collect metrics when writing** This is done with the Java API where during writing the upper, lower bound are tracked and the number of null- and nan records are counted. - [ ] **Collect metrics from footer** When an existing file is added, the footer of the Parquet file is opened to reconstruct all the metrics needed for Iceberg. - [ ] **Deletes** This mainly relies on strict projection to check if the data files cannot match with the predicate. - [ ] **Strict projection** needs to be added to the [transforms](https://github.com/apache/iceberg-python/pull/539). - [ ] **Strict Metrics Evaluator** to determine if the predicate [cannot match](https://github.com/apache/iceberg-python/pull/518). ## Metadata tables Metadata tables are used to [inspect the table](https://iceberg.apache.org/docs/1.7.0/spark-ddl/). Having these tables also allows easy implementation of the [maintenance procedures](https://iceberg.apache.org/docs/1.7.0/spark-procedures/) since you can easily list all the snapshots, and expire the ones that are older than a certain threshold. ## Contribute If you want to contribute to the upcoming milestone, feel free to comment on this issue. If there is anything unclear or missing, feel free to reach out here as well 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org