emkornfield commented on code in PR #14004: URL: https://github.com/apache/iceberg/pull/14004#discussion_r2461705646
########## format/spec.md: ########## @@ -1859,7 +1859,28 @@ Java writes `-1` for "no current snapshot" with V1 and V2 tables and considers t ### Naming for GZIP compressed Metadata JSON files -Some implementations require that GZIP compressed files have the suffix `.gz.metadata.json` to be read correctly. The Java reference implementation can additionally read GZIP compressed files with the suffix `metadata.json.gz`. +Some implementations require that GZIP compressed files have the suffix `.gz.metadata.json` to be read correctly. The Java reference implementation can additionally read GZIP compressed files with the suffix `metadata.json.gz`. + +### Schema Evolution/Type Promotion + +Column projection rules are designed so that the table will remain readable even if writers use an outdated schema. At the beginning of a transaction Writers should load the latest schema (the schema referened by `current-schema-id` from the latest table metadata) and use it for reading and writing data. Note, that in the common cases of schema evolution (adding nullable columns, adding required columns with an `initial-default`, renaming a column, dropping a column, or doing type promotion), appending data with outdated schemas presents no issues under either SNAPSHOT or SERIALIZABLE isolation levels + +However, the less common case of updating default values may need to be handled depending on isolation level. Consider two concurrent transactions: + +* **T1** modifies the `write-default` on the column. +* **T2** writes data that makes use of `write-default` from the changed column in the first transaction. + +If the **T1** commits before **T2** then handling **T2** depends on isolation level. + +* **SNAPSHOT**: **T2** may be commited even though it used the old `write-default` (this is a permitted serialization anomaly). +* **SERIALIZABLE**: **T2** must abort. + +When a transaction is aborted, the transaction could be retried after updating to the new schema and rewriting the data using the new `write-default`. One way of ensuring SERIALIZABLE isolation is a two phased approach: + +1. Check if there was a schema change (for the REST catalog this can be done with `assert-current-schema-id`) when committing. +2. If the schema changed, determine if there was a change to a `write-default` value used in the transaction (if there is no such column the transaction may be retried without rewriting data). + +Writers must write out all fields with the types specified from the schema loaded at the beginning of the transaction. Writing all fields prevents similar issues as those outlined above but with `initial-default` instead of `write-default` (all nulls can't be distinguished from missing columns that would have initial default substituted). Review Comment: Removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
