anvsk opened a new issue, #538:
URL: https://github.com/apache/iceberg-go/issues/538

   ### Apache Iceberg version
   
   main (development)
   
   ### Please describe the bug 🐞
   
   
   ## Environment
   
   * **iceberg-go**: v0.3.1
   * **Catalog**: REST catalog (standard Iceberg REST API)
   * **Go**: 1.21+
   * **OS**: Linux/Windows (reproducible across)
   
   ## Summary
   
   In v0.3.0, `table.NewAddSchemaUpdate(*Schema, lastColumnID, initial)` 
allowed callers to ensure the table’s `last-assigned-field-id` stayed 
**monotonic** even when deleting the column that previously had the highest 
field ID.
   
   In **v0.3.1**, the API changed to `table.NewAddSchemaUpdate(*Schema)` (no 
`lastColumnID`). When we **only delete the highest-ID column(s)** and **add no 
new columns**, the library appears to derive `last-assigned-field-id` from the 
**new schema’s max field id**, which **decreases**. The REST catalog then 
rejects the commit with:
   
   ```
   invalid_metadata: The specified metadata is not valid
   ```
   
   Deleting a first/middle column works; deleting the **tail (max-ID) column** 
fails.
   
   > Note: This repro excludes partition/sort references (i.e., we are not 
deleting a column referenced by the default spec or sort order).
   
   ## Steps to Reproduce
   
   1. Start with a table whose current schema has fields, e.g.:
   
      * `a` (id=1), `b` (id=2), `c` (id=3).
        No partition/sort references to `c`.
   
   2. Build a **new schema** that **removes `c`** and keeps `a`/`b` with the 
**same field IDs** (we don’t touch IDs).
   
   3. Submit **two updates in one commit**:
   
      * `AddSchema` using `table.NewAddSchemaUpdate(newSchema)`
      * `SetCurrentSchema` using `table.NewSetCurrentSchemaUpdate(newSchemaID)`
        (We obtain `newSchemaID` by running `b := 
table.MetadataBuilderFromBase(meta); id, _ := b.AddSchema(newSchema)`.)
   
   4. Include concurrency requirements (optional but recommended):
   
      * `AssertTableUUID(meta.UUID())`
      * `AssertLastAssignedFieldID(oldLastID)` where `oldLastID` is **3** in 
this example.
   
   5. `CommitTable(...)` → **fails** with `invalid_metadata`.
   
   Minimal code sketch (v0.3.1 style):
   
   ```go
   meta := tbl.Metadata()
   oldLast := highestID(tbl.Schema()) // returns 3 in the example
   
   // Build new schema that keeps a(id=1), b(id=2) only (delete c(id=3))
   newSchema := buildSchemaKeepAB(tbl.Schema()) // preserves existing IDs
   
   // Precompute new schema-id
   b := table.MetadataBuilderFromBase(meta)
   newSchemaID, err := b.AddSchema(newSchema)
   if err != nil { panic(err) }
   
   // Prepare updates (v0.3.1 API)
   add := table.NewAddSchemaUpdate(newSchema)           // no lastColumnID 
parameter anymore
   set := table.NewSetCurrentSchemaUpdate(newSchemaID)
   
   reqs := []table.Requirement{
       table.AssertTableUUID(meta.UUID()),
       table.AssertLastAssignedFieldID(int(oldLast)),   // oldLast == 3
   }
   _, _, err = cat.CommitTable(ctx, tbl, reqs, []table.Update{add, set})
   // => invalid_metadata when only deleting tail/highest-ID columns
   ```
   
   ## Expected Behavior
   
   * Deleting columns (including the highest-ID column) should be allowed as 
long as:
   
     * We do not change existing field IDs of the kept columns.
     * `last-assigned-field-id` **does not decrease** (i.e., remains the 
previous value).
   * In v0.3.0, passing `lastColumnID=oldLast` ensured monotonicity and commits 
succeeded.
   
   ## Actual Behavior
   
   * With v0.3.1, `NewAddSchemaUpdate` cannot accept `lastColumnID`.
   * When we only delete the max-ID column and add no new columns, the commit 
is rejected with `invalid_metadata`—apparently because the derived 
`last-assigned-field-id` regresses to the new schema’s max ID.
   
   ## Analysis
   
   * Iceberg requires `last-assigned-field-id` to be **monotonic** (never 
decreases).
   * In the “delete-tail-columns only” scenario, the **current** 
`last-assigned-field-id` is the old max (e.g., 3). The **new schema’s** max 
becomes smaller (e.g., 2).
     If the client or server infers the counter from the new schema’s max, it 
violates monotonicity → `invalid_metadata`.
   
   ## Workarounds
   
   * **Add a sentinel (dummy) column** in the same update with ID = `oldLast + 
1` (e.g., `__compat_padding_...`), nullable, never used. This keeps the new 
schema’s max ≥ old max.
     Or, more practically, add a real new column in the same change so max ID 
increases.
   * (Less ideal) Maintain a fork that restores the older API 
(`NewAddSchemaUpdate(schema, lastColumnID, initial)`) or custom-craft the REST 
payload to set `last-column-id = oldLast`.
   * Of course still ensure you’re not deleting a field referenced by partition 
spec or sort order (not the case in this repro).
   
   ## Proposal
   
   * **API / behavior options:**
   
     1. Re-introduce a way to set **`lastColumnID`** (or an equivalent 
parameter) on `AddSchema` in the Go client; or
     2. Have the client compute `last-assigned-field-id` as `max(oldLastID, 
max(newSchema.FieldIDs))` so it never regresses; or
     3. Provide a dedicated update or requirement to explicitly set/preserve 
`last-assigned-field-id` without requiring a dummy column.
   
   * **Docs**: Clarify in v0.3.1 migration notes that callers must ensure the 
counter doesn’t regress when deleting the highest-ID column, and suggest 
recommended patterns.
   
   ## Additional Context
   
   * The same flow succeeds if we delete a middle/first column (the new 
schema’s max ID stays the same).
   * The same flow succeeds if we add at least one new column (the new schema’s 
max ID increases).
   
   Happy to provide a tiny repro program if needed. Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to