rdblue commented on code in PR #10955:
URL: https://github.com/apache/iceberg/pull/10955#discussion_r1725963520
##########
format/spec.md:
##########
@@ -222,11 +228,41 @@ Schemas may be evolved by type promotion or adding,
deleting, renaming, or reord
Evolution applies changes to the table's current schema to produce a new
schema that is identified by a unique schema ID, is added to the table's list
of schemas, and is set as the table's current schema.
-Valid type promotions are:
+Valid primitive type promotions are:
+
+| Primitive type | v1, v2 valid type promotions | v3+ valid type promotions
| Requirements |
+|------------------|------------------------------|--------------------------------------------|--------------|
+| `unknown` | | _any type_
| |
+| `int` | `long` | `long`, `string`
| |
+| `long` | | `string`
| |
+| `date` | | `timestamp`,
`timestamp_ns` | Promotion to `timestamptz` or `timestamptz_ns`
is **not** allowed |
+| `float` | `double` | `double`
| |
+| `decimal(P, S)` | `decimal(P', S)` if `P' > P` | `decimal(P', S)` if `P' >
P` | Widen precision only |
+| _all types_ | | `variant` if Variant spec
allows the type | Conversion must produce a Variant containing a single value
of the original type |
+
+Iceberg's Avro manifest format does not store the type of lower and upper
bounds, and type promotion does not rewrite existing bounds. For example, when
a `float` is promoted to `double`, existing data file bounds are encoded as 4
little-endian bytes rather than 8 little-endian bytes for `double`. To
correctly decode the value, the original type at the time the file was written
must be inferred according to the following table:
+
+| Current type | Length of bounds | Inferred type at write time | Special
handling |
+|------------------|------------------|-----------------------------|------------------|
+| `long` | 4 bytes | `int` | |
+| `long` | 8 bytes | `long` | |
+| `double` | 4 bytes | `float` | |
+| `double` | 8 bytes | `double` | |
+| `timestamp` | 4 bytes | `date` | |
+| `timestamp` | 8 bytes | `timestamp` | |
+| `timestamp_ns` | 4 bytes | `date` | |
+| `timestamp_ns` | 8 bytes | `timestamp_ns` | |
+| `string` | 4 bytes | `int` or `string` | Discard
and ignore bounds [1] |
+| `string` | 8 bytes | `long` or `string` | Discard
and ignore bounds [1] |
+| `string` | not 4 or 8 bytes | `string` | Use
`string` if either bound is not 4 or 8 bytes |
+| `decimal(P, S)` | _any_ | `decimal(P', S)`; `P' <= P` | |
+| `variant` | _any_ | Any type | Discard
and ignore bounds |
-* `int` to `long`
-* `float` to `double`
-* `decimal(P, S)` to `decimal(P', S)` if `P' > P` -- widen the precision of
decimal types.
+Notes:
+1. Allowed promotion from `long` to `timestamptz` makes 8-byte lower and upper
bounds ambiguous becuase `long` values are in milliseconds and `timestamptz`
values are in milliseconds.
Review Comment:
Yeah, this was explaining why stats had to be ignored for `long` to
`timestamptz` that I originally had in the table above. However, because of the
ambiguity here I removed the type promotion from this PR. I think we can
support it in v4 cleanly.
You're right about the typo. The problem is that you can't distinguish
between a milliseconds value (original type `long`) and a microseconds value
(original type `timestamptz`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]