sdd opened a new issue, #478: URL: https://github.com/apache/iceberg-rust/issues/478
I've been working on creating a performance testing suite to measure the performance impact of the concurrent table scan work that I've been doing. I created a docker-compose file that uses the Tabular spark-iceberg container and minio to create an Iceberg table and insert data into it from the widely-used NYC Taxi dataset. This is the spark SQL that I used to create the table: ```SQL CREATE DATABASE IF NOT EXISTS nyc.taxis; DROP TABLE IF EXISTS nyc.taxis; CREATE TABLE nyc.taxis ( VendorID bigint, tpep_pickup_datetime timestamp, tpep_dropoff_datetime timestamp, passenger_count double, trip_distance double, RatecodeID double, store_and_fwd_flag string, PULocationID bigint, DOLocationID bigint, payment_type bigint, fare_amount double, extra double, mta_tax double, tip_amount double, tolls_amount double, improvement_surcharge double, total_amount double, congestion_surcharge double, airport_fee double ) USING iceberg PARTITIONED BY (days(tpep_pickup_datetime)); ``` Note the partition on a timestamp column with a transform of `Day`. When it inserts data, the reference Java Iceberg implementation writes the Avro manifest files, using an Avro type of Date for the partition struct value. Iceberg-rust's `PartitionSpec.partition_type()` method calls `partition_field.transform.result_type` in order to determine the type of fields when constructing a `StructType` for the table's partition schema: https://github.com/apache/iceberg-rust/blob/244a218d4000ea4bae7795f4f8845537cd110e07/crates/iceberg/src/spec/partition.rs#L90 `Transform`'s `result_type` method, for `Transform::Day`, maps this `PrimitiveType::Date` to `PrimitiveType::Int`: https://github.com/apache/iceberg-rust/blob/244a218d4000ea4bae7795f4f8845537cd110e07/crates/iceberg/src/spec/transform.rs#L197-L208 **This is inconsistent with the Java implementation, and I think this should map to `PrimitiveType::Date`.** As a consequence, when a file plan in iceberg-rust tries to parse a manifest file written by the Java implementation with a `Transform::Day` partition, `manifest_schema_v2()` in manifest.rs calls `schema_to_avro_schema()` which visits the partition schema with `SchemaToAvroSchema`. This understandably transforms the `PrimitiveType::Int` into `AvroSchema::Int`: https://github.com/apache/iceberg-rust/blob/244a218d4000ea4bae7795f4f8845537cd110e07/crates/iceberg/src/avro/schema.rs#L193-L199 This inconsistency between the schema passed to apache_avro's parser and the schema of the manifest file itself causes apache_avro to fail to parse the manifest. I Propose changing transform.rs line 202 to map to `PrimitiveType::Date` as this fixes the problem. This necessitates changing the https://github.com/apache/iceberg-rust/blob/244a218d4000ea4bae7795f4f8845537cd110e07/crates/iceberg/src/spec/transform.rs#L202 This necessitates changing the following tests as they fail with the proposed change: * spec::partition::tests::test_partition_type * transform::temporal::test::test_day_transform * transform::temporal::test::test_month_transform * transform::temporal::test::test_year_transform I'll create a PR containing the fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org