[I] Iceberg-rust Write support [iceberg-rust]

via GitHub Mon, 18 Nov 2024 01:01:07 -0800


Fokko opened a new issue, #700:
URL: https://github.com/apache/iceberg-rust/issues/700


   # Iceberg-rust Write support
   
   I've noticed a lot of interest in write support in Iceberg-rust. This issue 
aims to break this down into smaller pieces so they can be picked up in 
parallel.
   
   ## Commit path
   
   The commit path entails writing a new metadata JSON.
   
   - [ ] **Applying updates to the metadata** [Updating the 
metadata](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L706-L956)
 is important both for writing a new version of the JSON in case of a non-REST 
catalog, but also to keep an up-to-date version in memory. It is recommended to 
re-use the 
[Updates](https://github.com/apache/iceberg/blob/866021d7d34f274349ce7de1f29d113395e7f28c/open-api/rest-catalog-open-api.yaml#L2557-L2575)/[Requirement](https://github.com/apache/iceberg/blob/866021d7d34f274349ce7de1f29d113395e7f28c/open-api/rest-catalog-open-api.yaml#L2588-L2605)
 objects provided by the [REST catalog 
protocol](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml).
 PyIceberg uses a 
[similar](https://github.com/apache/iceberg-python/blob/b2f0a9e5cd7dd548e19cdcdd7f9205f03454369a/pyiceberg/table/update/__init__.py#L244)
 approach.
     - [x] **REST Catalog** [serialize the updates and 
requirements](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L1364-L1373)
 into JSON which is [dispatched to the REST 
catalog](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/rest.py#L675-L706).
 Done in https://github.com/apache/iceberg-rust/pull/97.
     - [ ] **Other catalogs** For the other catalogs, instead of dispatching 
the updates/requirements to the catalog. There are additional steps:
       - [ ] Logic to [validate the 
requirements](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/glue.py#L453-L455)
 against the metadata, to detect commit conflicts. A lot of this logic is 
already being implemented by https://github.com/apache/iceberg-rust/pull/587.
       - [ ] Writing a new version of the 
[metadata.json](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/__init__.py#L775-L776)
 to the object store. Taking into account the naming [as mentioned in the 
spec](https://github.com/apache/iceberg/blob/main/format/spec.md#metastore-tables).
       - [ ] Provide locking mechanisms within the commit 
([Glue](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/glue.py#L476-L483),
 
[Hive](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/hive.py#L379),
 
[SQL](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/sql.py#L426),
 ..) so the atomic swap happens safely.
         - [ ] **SQL** Looks like conflict [detection is 
missing](https://github.com/apache/iceberg-rust/blob/50345196c87b00badc1a6490aef284e84f4c3e9a/crates/catalog/sql/src/catalog.rs#L475).
 I was expecting logic there to see if rows are being affected (if not, another 
process has altered the table).
   - [ ] **Update table properties** Sets [properties on the 
table](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L326).
 Probably the best to start with since it doesn't require a complicated API.
   - [ ] **Schema evolution** 
[API](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L1809)
 to update the schema, and produce new metadata.
     - [ ] Having the SchemaUpdate API to evolve the schema without an user 
have to worry about field-IDs:  
https://github.com/apache/iceberg-rust/issues/697
     - [ ] Add the `unionByName` to easily union two schemas to provide easy 
schema evolution: https://github.com/apache/iceberg-rust/issues/698
   - [ ] **Partition spec evolution** 
[API](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L3003)
 to update the partition spec, and produce new metadata.
   - [ ] **Sort order evolution** API to update the schema, and produce new 
metadata.
   - [ ] **Commit semantics**
     - [ ] **MergeAppend** appends new manifest list entries to existing 
manifest files. Reduces the amount of metadata produced, but takes some more 
time to commit since existing metadata has to be rewritten, and retries are 
also more costly.
     - [ ] **FastAppend** Generates a new manifest per commit, which allows 
fast commits, but generates more metadata in the long run. PR by @ZENOTME in 
https://github.com/apache/iceberg-rust/pull/349
   - [ ] **Snapshot generation** manipulation of data within a table is done by 
[appending snapshots](https://iceberg.apache.org/spec/#snapshots) to the 
metadata JSON.
     - [ ] **APPEND** Only data files were added and no files were removed.
     - [ ] **REPLACE** Data and delete files were added and removed without 
changing table data; i.e., compaction, changing the data file format, or 
relocating data files.
     - [ ] **OVERWRITE** Data and delete files were added and removed in a 
logical overwrite operation.
     - [ ] **DELETE** Data files were removed and their contents logically 
deleted and/or delete files were added to delete rows.
   - [ ] **Add files** to add existing Parquet files to a table. Issue in 
https://github.com/apache/iceberg-rust/issues/345
     - [ ] [**Name 
mapping**](https://iceberg.apache.org/spec/#column-projection) in case the 
files don't have field-IDs set.
   - [ ] [**Summary generations**] Part of the snapshot that indicates what's 
in the snapshot.
   - [ ] **Metrics collection** There are two situations:
     - [ ] **Collect metrics when writing** This is done with the Java API 
where during writing the upper, lower bound are tracked and the number of null- 
and nan records are counted.
     - [ ] **Collect metrics from footer** When an existing file is added, the 
footer of the Parquet file is opened to reconstruct all the metrics needed for 
Iceberg.
   - [ ] **Deletes** This mainly relies on strict projection to check if the 
data files cannot match with the predicate.
     - [ ] **Strict projection** needs to be added to the 
[transforms](https://github.com/apache/iceberg-python/pull/539).
     - [ ] **Strict Metrics Evaluator** to determine if the predicate [cannot 
match](https://github.com/apache/iceberg-python/pull/518).
   
   ## Metadata tables
   
   Metadata tables are used to [inspect the 
table](https://iceberg.apache.org/docs/1.7.0/spark-ddl/). Having these tables 
also allows easy implementation of the [maintenance 
procedures](https://iceberg.apache.org/docs/1.7.0/spark-procedures/) since you 
can easily list all the snapshots, and expire the ones that are older than a 
certain threshold.
   
   ## Contribute
   
   If you want to contribute to the upcoming milestone, feel free to comment on 
this issue. If there is anything unclear or missing, feel free to reach out 
here as well 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Iceberg-rust Write support [iceberg-rust]

Reply via email to