anuragmantri opened a new issue, #15146:
URL: https://github.com/apache/iceberg/issues/15146

   ### Proposed Change
   
   # Proposal: Efficient column updates in Iceberg
   
   As Iceberg increasingly supports AI and Machine Learning workloads, updating 
"wide tables" presents a significant efficiency challenge. Feature stores and 
vector databases often manage tables with thousands of columns, where updates 
frequently target only a small subset of features—such as refreshing 
embeddings, labels, or model scores. Current Apache Iceberg primitives, 
Copy-on-Write and Merge-on-Read, operate at the row level. This approach 
requires rewriting unrelated data during updates, resulting in write 
amplification that affects both performance and operational costs.
   
   
   ## Use-cases
   The row-granularity limitation becomes particularly problematic for ML/AI 
workloads:
   
   - **Feature Backfilling & Column Updates**: A common workflow involves 
adding a new feature column (e.g., a model embedding) to a petabyte-scale table.
   - **Model Score Updates**:  Refreshing prediction scores after model 
retraining involves updating a subset of score columns in wide tables. 
   - **Embedding Refresh**: Updating vector embeddings in wide feature tables 
causes the entire row to be rewritten.
   - **Incremental Feature Computation**: Daily batch jobs that compute and 
update 5-10 features out of 200 total features making daily updates 
cost-prohibitive at petabyte scale.
   
   ## Goals
   - Reduce write amplification on column updates where all the rows need to be 
updated.
   - Preserve read efficiency (column stats and pruning capabilities)
   - Leverage V4 architecture (build on [Iceberg Single File 
Commits](https://docs.google.com/document/d/1k4x8utgh41Sn1tr98eynDKCWq035SV_f75rtNHcerVw/edit?tab=t.0#heading=h.unn922df0zzw)
 and [Column Stats 
Improvements](https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I/edit?tab=t.0#heading=h.hs6r9d26w1y2)
 proposals)
   
   ## Non-goals
   - Partial updates i.e updates impacting a subset of rows are not covered in 
this design.
   
   ## Proposal 
   This proposal attempts to address the write amplification problem in Iceberg 
by introducing column-level updates, enabling engines to write only the updated 
columns to separate column files while leaving unchanged columns in the 
original base files and efficiently stitch the column files during read time to 
materialize all the rows of the table.
   
   ### Proposal document
   
   
https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs
   
   ### Specifications
   
   - [x] Table
   - [ ] View
   - [ ] REST
   - [ ] Puffin
   - [ ] Encryption
   - [x] Other


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to