sdd opened a new issue, #478:
URL: https://github.com/apache/iceberg-rust/issues/478

   I've been working on creating a performance testing suite to measure the 
performance impact of the concurrent table scan work that I've been doing. I 
created a docker-compose file that uses the Tabular spark-iceberg container and 
minio to create an Iceberg table and insert data into it from the widely-used 
NYC Taxi dataset. This is the spark SQL that I used to create the table:
   
   ```SQL
   CREATE DATABASE IF NOT EXISTS nyc.taxis;
   DROP TABLE IF EXISTS nyc.taxis;
   CREATE TABLE nyc.taxis (
       VendorID              bigint,
       tpep_pickup_datetime  timestamp,
       tpep_dropoff_datetime timestamp,
       passenger_count       double,
       trip_distance         double,
       RatecodeID            double,
       store_and_fwd_flag    string,
       PULocationID          bigint,
       DOLocationID          bigint,
       payment_type          bigint,
       fare_amount           double,
       extra                 double,
       mta_tax               double,
       tip_amount            double,
       tolls_amount          double,
       improvement_surcharge double,
       total_amount          double,
       congestion_surcharge  double,
       airport_fee           double
   )
   USING iceberg
   PARTITIONED BY (days(tpep_pickup_datetime));
   ```
   
   Note the partition on a timestamp column with a transform of `Day`.
   
   When it inserts data, the reference Java Iceberg implementation writes the 
Avro manifest files, using an Avro type of Date for the partition struct value.
   
   Iceberg-rust's `PartitionSpec.partition_type()` method calls 
`partition_field.transform.result_type` in order to determine the type of 
fields when constructing a `StructType` for the table's partition schema:
   
   
https://github.com/apache/iceberg-rust/blob/244a218d4000ea4bae7795f4f8845537cd110e07/crates/iceberg/src/spec/partition.rs#L90
   
   `Transform`'s `result_type` method, for `Transform::Day`, maps this 
`PrimitiveType::Date` to `PrimitiveType::Int`:
   
   
https://github.com/apache/iceberg-rust/blob/244a218d4000ea4bae7795f4f8845537cd110e07/crates/iceberg/src/spec/transform.rs#L197-L208
   
   **This is inconsistent with the Java implementation, and I think this should 
map to `PrimitiveType::Date`.**
   
   As a consequence, when a file plan in iceberg-rust tries to parse a manifest 
file written by the Java implementation with a `Transform::Day` partition, 
`manifest_schema_v2()` in manifest.rs calls `schema_to_avro_schema()` which 
visits the partition schema with `SchemaToAvroSchema`. This understandably 
transforms the `PrimitiveType::Int` into `AvroSchema::Int`:
   
   
https://github.com/apache/iceberg-rust/blob/244a218d4000ea4bae7795f4f8845537cd110e07/crates/iceberg/src/avro/schema.rs#L193-L199
   
   This inconsistency between the schema passed to apache_avro's parser and the 
schema of the manifest file itself causes apache_avro to fail to parse the 
manifest.
   
   I Propose changing transform.rs line 202 to map to `PrimitiveType::Date` as 
this fixes the problem. This necessitates changing the 
   
   
https://github.com/apache/iceberg-rust/blob/244a218d4000ea4bae7795f4f8845537cd110e07/crates/iceberg/src/spec/transform.rs#L202
   
   This necessitates changing the following tests as they fail with the 
proposed change:
   
   * spec::partition::tests::test_partition_type
   * transform::temporal::test::test_day_transform
   * transform::temporal::test::test_month_transform
   * transform::temporal::test::test_year_transform
   
   I'll create a PR containing the fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to