RussellSpitzer commented on code in PR #10981: URL: https://github.com/apache/iceberg/pull/10981#discussion_r1767640504
########## format/spec.md: ########## @@ -1000,59 +1010,66 @@ Values should be stored in Parquet using the types and logical type annotations Lists must use the [3-level representation](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists). -| Type | Parquet physical type | Logical type | Notes | -|--------------------|--------------------------------------------------------------------|---------------------------------------------|----------------------------------------------------------------| -| **`boolean`** | `boolean` | | | -| **`int`** | `int` | | | -| **`long`** | `long` | | | -| **`float`** | `float` | | | -| **`double`** | `double` | | | -| **`decimal(P,S)`** | `P <= 9`: `int32`,<br />`P <= 18`: `int64`,<br />`fixed` otherwise | `DECIMAL(P,S)` | Fixed must use the minimum number of bytes that can store `P`. | -| **`date`** | `int32` | `DATE` | Stores days from 1970-01-01. | -| **`time`** | `int64` | `TIME_MICROS` with `adjustToUtc=false` | Stores microseconds from midnight. | -| **`timestamp`** | `int64` | `TIMESTAMP_MICROS` with `adjustToUtc=false` | Stores microseconds from 1970-01-01 00:00:00.000000. | -| **`timestamptz`** | `int64` | `TIMESTAMP_MICROS` with `adjustToUtc=true` | Stores microseconds from 1970-01-01 00:00:00.000000 UTC. | -| **`timestamp_ns`** | `int64` | `TIMESTAMP_NANOS` with `adjustToUtc=false` | Stores nanoseconds from 1970-01-01 00:00:00.000000000. | -| **`timestamptz_ns`** | `int64` | `TIMESTAMP_NANOS` with `adjustToUtc=true` | Stores nanoseconds from 1970-01-01 00:00:00.000000000 UTC. | -| **`string`** | `binary` | `UTF8` | Encoding must be UTF-8. | -| **`uuid`** | `fixed_len_byte_array[16]` | `UUID` | | -| **`fixed(L)`** | `fixed_len_byte_array[L]` | | | -| **`binary`** | `binary` | | | -| **`struct`** | `group` | | | -| **`list`** | `3-level list` | `LIST` | See Parquet docs for 3-level representation. | -| **`map`** | `3-level map` | `MAP` | See Parquet docs for 3-level representation. | +| Type | Parquet physical type | Logical type | Notes | +|--------------------|--------------------------------------------------------------------|---------------------------------------------|---------------------------------------------------------------------------------------------------------| +| **`boolean`** | `boolean` | | | +| **`int`** | `int` | | | +| **`long`** | `long` | | | +| **`float`** | `float` | | | +| **`double`** | `double` | | | +| **`decimal(P,S)`** | `P <= 9`: `int32`,<br />`P <= 18`: `int64`,<br />`fixed` otherwise | `DECIMAL(P,S)` | Fixed must use the minimum number of bytes that can store `P`. | +| **`date`** | `int32` | `DATE` | Stores days from 1970-01-01. | +| **`time`** | `int64` | `TIME_MICROS` with `adjustToUtc=false` | Stores microseconds from midnight. | +| **`timestamp`** | `int64` | `TIMESTAMP_MICROS` with `adjustToUtc=false` | Stores microseconds from 1970-01-01 00:00:00.000000. | +| **`timestamptz`** | `int64` | `TIMESTAMP_MICROS` with `adjustToUtc=true` | Stores microseconds from 1970-01-01 00:00:00.000000 UTC. | +| **`timestamp_ns`** | `int64` | `TIMESTAMP_NANOS` with `adjustToUtc=false` | Stores nanoseconds from 1970-01-01 00:00:00.000000000. | +| **`timestamptz_ns`** | `int64` | `TIMESTAMP_NANOS` with `adjustToUtc=true` | Stores nanoseconds from 1970-01-01 00:00:00.000000000 UTC. | +| **`string`** | `binary` | `UTF8` | Encoding must be UTF-8. | +| **`uuid`** | `fixed_len_byte_array[16]` | `UUID` | | +| **`fixed(L)`** | `fixed_len_byte_array[L]` | | | +| **`binary`** | `binary` | | | +| **`struct`** | `group` | | | +| **`list`** | `3-level list` | `LIST` | See Parquet docs for 3-level representation. | +| **`map`** | `3-level map` | `MAP` | See Parquet docs for 3-level representation. | +| **`geometry`** | `binary` | `GEOMETRY` | WKB format, see Appendix G. Logical type annotation optional for supported Parquet format versions [1]. | + +Notes: +1. [https://github.com/apache/parquet-format/pull/240](https://github.com/apache/parquet-format/pull/240)) ### ORC **Data Type Mappings** -| Type | ORC type | ORC type attributes | Notes | -|--------------------|---------------------|------------------------------------------------------|-----------------------------------------------------------------------------------------| -| **`boolean`** | `boolean` | | | -| **`int`** | `int` | | ORC `tinyint` and `smallint` would also map to **`int`**. | -| **`long`** | `long` | | | -| **`float`** | `float` | | | -| **`double`** | `double` | | | -| **`decimal(P,S)`** | `decimal` | | | -| **`date`** | `date` | | | -| **`time`** | `long` | `iceberg.long-type`=`TIME` | Stores microseconds from midnight. | -| **`timestamp`** | `timestamp` | `iceberg.timestamp-unit`=`MICROS` | Stores microseconds from 2015-01-01 00:00:00.000000. [1], [2] | -| **`timestamptz`** | `timestamp_instant` | `iceberg.timestamp-unit`=`MICROS` | Stores microseconds from 2015-01-01 00:00:00.000000 UTC. [1], [2] | -| **`timestamp_ns`** | `timestamp` | `iceberg.timestamp-unit`=`NANOS` | Stores nanoseconds from 2015-01-01 00:00:00.000000000. [1] | -| **`timestamptz_ns`** | `timestamp_instant` | `iceberg.timestamp-unit`=`NANOS` | Stores nanoseconds from 2015-01-01 00:00:00.000000000 UTC. [1] | -| **`string`** | `string` | | ORC `varchar` and `char` would also map to **`string`**. | -| **`uuid`** | `binary` | `iceberg.binary-type`=`UUID` | | -| **`fixed(L)`** | `binary` | `iceberg.binary-type`=`FIXED` & `iceberg.length`=`L` | The length would not be checked by the ORC reader and should be checked by the adapter. | -| **`binary`** | `binary` | | | -| **`struct`** | `struct` | | | -| **`list`** | `array` | | | -| **`map`** | `map` | | | + +| Type | ORC type | ORC type attributes | Notes | +|--------------------|---------------------|------------------------------------------------------|------------------------------------------------------------------------------------------| +| **`boolean`** | `boolean` | | | +| **`int`** | `int` | | ORC `tinyint` and `smallint` would also map to **`int`**. | +| **`long`** | `long` | | | +| **`float`** | `float` | | | +| **`double`** | `double` | | | +| **`decimal(P,S)`** | `decimal` | | | +| **`date`** | `date` | | | +| **`time`** | `long` | `iceberg.long-type`=`TIME` | Stores microseconds from midnight. | +| **`timestamp`** | `timestamp` | `iceberg.timestamp-unit`=`MICROS` | Stores microseconds from 2015-01-01 00:00:00.000000. [1], [2] | +| **`timestamptz`** | `timestamp_instant` | `iceberg.timestamp-unit`=`MICROS` | Stores microseconds from 2015-01-01 00:00:00.000000 UTC. [1], [2] | +| **`timestamp_ns`** | `timestamp` | `iceberg.timestamp-unit`=`NANOS` | Stores nanoseconds from 2015-01-01 00:00:00.000000000. [1] | +| **`timestamptz_ns`** | `timestamp_instant` | `iceberg.timestamp-unit`=`NANOS` | Stores nanoseconds from 2015-01-01 00:00:00.000000000 UTC. [1] | +| **`string`** | `string` | | ORC `varchar` and `char` would also map to **`string`**. | +| **`uuid`** | `binary` | `iceberg.binary-type`=`UUID` | | +| **`fixed(L)`** | `binary` | `iceberg.binary-type`=`FIXED` & `iceberg.length`=`L` | The length would not be checked by the ORC reader and should be checked by the adapter. | +| **`binary`** | `binary` | | | +| **`struct`** | `struct` | | | +| **`list`** | `array` | | | +| **`map`** | `map` | | | +| **`geometry`** | `binary` | | WKB format, see Appendix G; Optional geometry type for supported ORC format versions [3] | Review Comment: ```[Appendix G](#appendix-g)``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org