anuragmantri opened a new issue, #15146: URL: https://github.com/apache/iceberg/issues/15146
### Proposed Change # Proposal: Efficient column updates in Iceberg As Iceberg increasingly supports AI and Machine Learning workloads, updating "wide tables" presents a significant efficiency challenge. Feature stores and vector databases often manage tables with thousands of columns, where updates frequently target only a small subset of features—such as refreshing embeddings, labels, or model scores. Current Apache Iceberg primitives, Copy-on-Write and Merge-on-Read, operate at the row level. This approach requires rewriting unrelated data during updates, resulting in write amplification that affects both performance and operational costs. ## Use-cases The row-granularity limitation becomes particularly problematic for ML/AI workloads: - **Feature Backfilling & Column Updates**: A common workflow involves adding a new feature column (e.g., a model embedding) to a petabyte-scale table. - **Model Score Updates**: Refreshing prediction scores after model retraining involves updating a subset of score columns in wide tables. - **Embedding Refresh**: Updating vector embeddings in wide feature tables causes the entire row to be rewritten. - **Incremental Feature Computation**: Daily batch jobs that compute and update 5-10 features out of 200 total features making daily updates cost-prohibitive at petabyte scale. ## Goals - Reduce write amplification on column updates where all the rows need to be updated. - Preserve read efficiency (column stats and pruning capabilities) - Leverage V4 architecture (build on [Iceberg Single File Commits](https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw) and [Column Stats Improvements](https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I/edit?tab=t.0#heading=h.hs6r9d26w1y2) proposals) ## Non-goals - Partial updates i.e updates impacting a subset of rows are not covered in this design. ## Proposal This proposal attempts to address the write amplification problem in Iceberg by introducing column-level updates, enabling engines to write only the updated columns to separate column files while leaving unchanged columns in the original base files and efficiently stitch the column files during read time to materialize all the rows of the table. ### Proposal document https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs ### Specifications - [x] Table - [ ] View - [ ] REST - [ ] Puffin - [ ] Encryption - [x] Other -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
