nssalian commented on code in PR #3098: URL: https://github.com/apache/iceberg-python/pull/3098#discussion_r2891173341
########## mkdocs/docs/api.md: ########## @@ -2039,3 +2039,82 @@ DataFrame() | 3 | 6 | +---+---+ ``` + +## Type mapping + +### PyArrow + +The Iceberg specification only specifies type mapping for Avro, Parquet, and ORC: + +- [Iceberg to Avro](https://iceberg.apache.org/spec/#avro) + +- [Iceberg to Parquet](https://iceberg.apache.org/spec/#parquet) + +- [Iceberg to ORC](https://iceberg.apache.org/spec/#orc) + +The following tables describe the type mappings between PyIceberg and PyArrow. In the tables below, `pa` refers to the `pyarrow` module: + +```python +import pyarrow as pa +``` + +#### PyIceberg to PyArrow type mapping + +| PyIceberg type class | PyArrow type | Notes | +|---------------------------------|-------------------------------------|----------------------------------------| +| `BooleanType` | `pa.bool_()` | | +| `IntegerType` | `pa.int32()` | | +| `LongType` | `pa.int64()` | | +| `FloatType` | `pa.float32()` | | +| `DoubleType` | `pa.float64()` | | +| `DecimalType(p, s)` | `pa.decimal128(p, s)` | | +| `DateType` | `pa.date32()` | | +| `TimeType` | `pa.time64("us")` | | +| `TimestampType` | `pa.timestamp("us")` | | +| `TimestampNanoType` | `pa.timestamp("ns")` | | +| `TimestamptzType` | `pa.timestamp("us", tz="UTC")` | | +| `TimestamptzNanoType` | `pa.timestamp("ns", tz="UTC")` | | +| `StringType` | `pa.large_string()` | | +| `UUIDType` | `pa.uuid()` | | +| `BinaryType` | `pa.large_binary()` | | +| `FixedType(L)` | `pa.binary(L)` | | +| `StructType` | `pa.struct()` | | +| `ListType(e)` | `pa.large_list(e)` | | +| `MapType(k, v)` | `pa.map_(k, v)` | | +| `UnknownType` | `pa.null()` | | + +--- + +#### PyArrow to PyIceberg type mapping + +| PyArrow type | PyIceberg type class | Notes | +|------------------------------------|-----------------------------|--------------------------------| +| `pa.bool_()` | `BooleanType` | | +| `pa.int32()` | `IntegerType` | | +| `pa.int64()` | `LongType` | | +| `pa.float32()` | `FloatType` | | +| `pa.float64()` | `DoubleType` | | +| `pa.decimal128(p, s)` | `DecimalType(p, s)` | | +| `pa.decimal256(p, s)` | Unsupported | | +| `pa.date32()` | `DateType` | | +| `pa.date64()` | Unsupported | | +| `pa.time64("us")` | `TimeType` | | +| `pa.timestamp("us")` | `TimestampType` | | +| `pa.timestamp("ns")` | `TimestampNanoType` | | +| `pa.timestamp("us", tz="UTC")` | `TimestamptzType` | | +| `pa.timestamp("ns", tz="UTC")` | `TimestamptzNanoType` | | Review Comment: I would add a note for both `us` and `ns` that this supports UTC_ALIASES only. and the `ns` is format_version=3 only I think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
