emkornfield commented on code in PR #14004: URL: https://github.com/apache/iceberg/pull/14004#discussion_r2334302838
########## format/spec.md: ########## @@ -1861,6 +1861,18 @@ Java writes `-1` for "no current snapshot" with V1 and V2 tables and considers t Some implementations require that GZIP compressed files have the suffix `.gz.metadata.json` to be read correctly. The Java reference implementation can additionally read GZIP compressed files with the suffix `metadata.json.gz`. +### Schema Evolution/Type Promotion + +Column projection rules are designed so that the table will remain readable even if writers use an outdated schema. Writers should bind the latest schema at the beginning of a transaction. Note, that in the common cases of schema evolution (adding nullable columns, adding required columns with an `initial-default`, renaming a column, dropping a column, or doing type promotion) then appending data with outdated schemas presents no issues under either SNAPSHOT or SERIALIZABLE isolation levels. + +While writers are not required to bind to the latest schema there are edge cases to consider: + +1. Assume two transactions that are started concurrently. The first modifies the `write-default` on the column. The second is a data write that makes use of `write-default` from the changed column in the first transaction. If the first transaction gets committed first, the result of the second transaction depends on isolation level. Under SNAPSHOT isolation the second transaction can be committed. However, the second transaction produces the serialization anomaly of using the outdated `write-default` default value. SERIALIZABLE isolation does not allow for such anomolies and the second transaction must fail in this mode. The transaction could be retried after updating to the new schema and rewriting the data using the new `write-default`. Review Comment: Tried to add more details. Let me know if it makes sense as worded. ########## format/spec.md: ########## @@ -1861,6 +1861,18 @@ Java writes `-1` for "no current snapshot" with V1 and V2 tables and considers t Some implementations require that GZIP compressed files have the suffix `.gz.metadata.json` to be read correctly. The Java reference implementation can additionally read GZIP compressed files with the suffix `metadata.json.gz`. +### Schema Evolution/Type Promotion + +Column projection rules are designed so that the table will remain readable even if writers use an outdated schema. Writers should bind the latest schema at the beginning of a transaction. Note, that in the common cases of schema evolution (adding nullable columns, adding required columns with an `initial-default`, renaming a column, dropping a column, or doing type promotion) then appending data with outdated schemas presents no issues under either SNAPSHOT or SERIALIZABLE isolation levels. + +While writers are not required to bind to the latest schema there are edge cases to consider: + +1. Assume two transactions that are started concurrently. The first modifies the `write-default` on the column. The second is a data write that makes use of `write-default` from the changed column in the first transaction. If the first transaction gets committed first, the result of the second transaction depends on isolation level. Under SNAPSHOT isolation the second transaction can be committed. However, the second transaction produces the serialization anomaly of using the outdated `write-default` default value. SERIALIZABLE isolation does not allow for such anomolies and the second transaction must fail in this mode. The transaction could be retried after updating to the new schema and rewriting the data using the new `write-default`. + +2. Assume a sequence of the linear transactions: the first transaction adds a columnand populates it with new values. The second transactions is run using the schema prior to the new column being added and updates another column (e.g. `update table x set pre_existing_col='xyz'`). Transaction b) must fail under both SNAPSHOT and SERIALIZABLE isolation levels, since it would drop data from the new column added in the first transaction. If the transactions started concurrently, one of them should still fail with SNAPSHOT isolation because there is an overlap in the rows modified by the transactions. Review Comment: fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
