mbutrovich opened a new pull request, #1821: URL: https://github.com/apache/iceberg-rust/pull/1821
## Which issue does this PR close? Partially address #1749. ## What changes are included in this PR? This PR adds partition spec handling to `FileScanTask` and `RecordBatchTransformer` to correctly implement the Iceberg spec's "Column Projection" rules for partition transforms. ### Problem Statement Prior to this PR, `iceberg-rust`'s `FileScanTask` had no mechanism to pass partition information to `RecordBatchTransformer`, missing some behavior from the Iceberg spec: 1. **Incorrect handling of non-identity partition transforms**: All partition transforms (bucket, truncate, year, etc.) were treated the same, violating the Iceberg spec requirement that only identity-transformed fields should use constants from partition metadata. 2. **Field ID conflicts in add_files scenarios**: When importing Hive tables via `add_files`, partition columns with `initial_default` values could have field IDs that conflict with Parquet data column field IDs, requiring name-based mapping fallback. ### Iceberg Specification Requirements Per the Iceberg spec (format/spec.md "Column Projection" section): > "Return the value from partition metadata if an Identity Transform exists for the field and the partition value is present in the partition struct on data_file object in the manifest." **Why this matters:** - **Identity transforms** (e.g., `identity(dept)`) store actual column values in partition metadata that can be used as constants without reading the data file - **Non-identity transforms** (e.g., `bucket(4, id)`, `day(timestamp)`) store transformed values in partition metadata (e.g., bucket number 2, not the actual `id` values 100, 200, 300) and must read source columns from the data file ### Changes Made 1. **Added partition fields to `FileScanTask`** (`scan/task.rs`): - `partition: Option<Struct>` - Partition data from manifest entry - `partition_spec_id: Option<i32>` - Partition spec ID - `partition_spec: Option<Arc<PartitionSpec>>` - For transform-aware constant detection 2. **Implemented `constants_map()` function** (`arrow/record_batch_transformer.rs`): - Replicates Java's `PartitionUtil.constantsMap()` behavior - Only includes fields where transform is `Transform::Identity` - Used to determine which fields use partition metadata constants vs. reading from data files 3. **Enhanced `RecordBatchTransformer`** (`arrow/record_batch_transformer.rs`): - Added `build_with_partition_data()` method to accept partition spec/data - Implements spec's column resolution with identity-transform awareness - Detects field ID conflicts (add_files scenario) and falls back to name-based mapping 4. **Updated `ArrowReader`** (`arrow/reader.rs`): - Uses `build_with_partition_data()` when partition information is available - Falls back to `build()` when not available 5. **Updated manifest entry processing** (`scan/context.rs`): - Populates partition fields in `FileScanTask` from manifest entry data ### Tests Added 1. **`bucket_partitioning_reads_source_column_from_file`** - Verifies that bucket-partitioned source columns are read from data files (not treated as constants from partition metadata) 2. **`identity_partition_uses_constant_from_metadata`** - Verifies that identity-transformed fields correctly use partition metadata constants 3. **`add_files_partition_columns_with_field_id_conflict`** - Verifies correct name-based fallback when partition column field IDs conflict with Parquet data column field IDs ## Are these changes tested? Yes, there are 3 new unit tests. This also resolved about 50 Iceberg Java tests when running with the DataFusion Comet's experimental https://github.com/apache/datafusion-comet/pull/2528 PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
