zmk-wawa opened a new issue, #14627:
URL: https://github.com/apache/iceberg/issues/14627

   ### Feature Request / Improvement
   
   
https://github.com/apache/iceberg/blob/4ee507d5788e31c74d5ef77204ef126ae0105981/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java#L453-L496
   
https://github.com/apache/iceberg/blob/4ee507d5788e31c74d5ef77204ef126ae0105981/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java#L386-L409
   
   When updating Iceberg tables via REST, updateTable relies on optimistic 
concurrency control (OCC), which causes in-flight requests to fail when the 
table snapshot drifts from the request’s base snapshot. This behavior is overly 
strict for two common situations:
   1. APPEND operations: Concurrent appends that **do not change the table 
structure are **semantically commutative**. However, the current commit treats 
any snapshot drift as a hard conflict, leading to frequent retries and failures 
under high concurrency.
   2. Partition-scoped OVERWRITE operations: If concurrent **changes involve 
partitions that are disjoint from the overwrite’s target partitions,** the two 
operations are semantically independent. Nevertheless, the current 
snapshot-level check still rejects these changes.
   The end result in write-intensive, multi-writer deployments is excessive 
**retry traffic, high latency, and even task failures** after many retries.
   
   Expected changes:
   1. APPEND: If the snapshot drift since the request’s base contains only 
compatible changes (e.g., other appends; no structural metadata changes), the 
server should rebase the request onto the latest snapshot and commit, rather 
than failing immediately.
   2. OVERWRITE: If the drifted concurrent changes affect a set of partitions 
that are disjoint from the overwrite’s target partitions, the server should 
rebase onto the latest snapshot and commit (this should not be pushed back to 
the client to re-issue the update).
   These relaxations preserve correctness because they do not change the final 
table state relative to a serial ordering of the same operations.
   
   Possible implementation:
   1. For APPEND requests: after confirming **no structural changes** and 
detecting snapshot drift, refresh to the latest snapshot and re-commit the 
append before committing.
   2. For OVERWRITE requests: first **determine the overwrite’s target 
partition set** (e.g., via its filter or known partition keys). Then check 
which partitions have been affected by drifted concurrent changes since the 
base snapshot. If these partition sets are disjoint, **refresh to the latest 
snapshot** and re-commit the overwrite before committing.
   3. Provide a mechanism that, under **high APPEND concurrency**, allows each 
table to use a short-lived server-side queue to **aggregate or serialize APPEND 
operations** to reduce retries. This should be opt-in, with moderate parameters 
(small maximum wait time / batch size) to preserve overall concurrency; 
parameters could even adapt based on historical load.
   
   ### Query engine
   
   Spark
   
   ### Willingness to contribute
   
   - [ ] I can contribute this improvement/feature independently
   - [x] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [ ] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to