Fokko opened a new issue, #736:
URL: https://github.com/apache/iceberg-rust/issues/736

   The so-called fast-appends are added in 
https://github.com/apache/iceberg-rust/pull/349
   
   It would be good to also consider adding merge-commits.
   
   With the fast-append, a new manifest is written out and added to the 
manifest-list [as mentioned in the 
spec](https://iceberg.apache.org/spec/#snapshots). As the name suggests, this 
is the fastest way of appending new data, minimizing the chance of conflicts. 
Also, it works pretty well in the case of a commit, since only the manifest has 
to be rewritten in case of a conflict. The biggest drawback is that you create 
many manifests adding overhead in the long run (more calls to the object store 
than needed).
   
   The merge-commit takes an existing manifest, adds the new entries to it, and 
replaces the old manifest in the manifest-list. 
   
   Having too few manifests is not good because it will lead to limited 
parallelization, but too many will add much overhead in terms of networking and 
parsing. The thresholds can be configured through configuration, and have some 
reasonable defaults:
   
   
![image](https://github.com/user-attachments/assets/de52cdba-3e50-4bff-956a-bafc92ddfbfa)
   
   The goal of this issue is to add `MergeAppendAction` next to 
`FastAppendAction`. This is not a trivial task since there are some caveats:
   
   - Each manifest is bound to a certain partition strategy, meaning that the 
partition-spec-id is stored in the [Avro 
header](https://github.com/apache/iceberg/blob/main/format/spec.md#manifests), 
and they should be all the same.
   - When rewriting the existing manifests, the `ADDED` status must be changed 
to `EXISTING`, and the sequence numbers must be tracked correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to