mbutrovich opened a new pull request, #1821:
URL: https://github.com/apache/iceberg-rust/pull/1821

   ## Which issue does this PR close?
   
   Partially address #1749.
   
   ## What changes are included in this PR?
   
   This PR adds partition spec handling to `FileScanTask` and 
`RecordBatchTransformer` to correctly implement the Iceberg spec's "Column 
Projection" rules for partition transforms.
   
   ### Problem Statement
   
   Prior to this PR, `iceberg-rust`'s `FileScanTask` had no mechanism to pass 
partition information to `RecordBatchTransformer`, missing some behavior from 
the Iceberg spec:
   
   1. **Incorrect handling of non-identity partition transforms**: All 
partition transforms (bucket, truncate, year, etc.) were treated the same, 
violating the Iceberg spec requirement that only identity-transformed fields 
should use constants from partition metadata.
   
   2. **Field ID conflicts in add_files scenarios**: When importing Hive tables 
via `add_files`, partition columns with `initial_default` values could have 
field IDs that conflict with Parquet data column field IDs, requiring 
name-based mapping fallback.
   
   ### Iceberg Specification Requirements
   
   Per the Iceberg spec (format/spec.md "Column Projection" section):
   
   > "Return the value from partition metadata if an Identity Transform exists 
for the field and the partition value is present in the partition struct on 
data_file object in the manifest."
   
   **Why this matters:**
   - **Identity transforms** (e.g., `identity(dept)`) store actual column 
values in partition metadata that can be used as constants without reading the 
data file
   - **Non-identity transforms** (e.g., `bucket(4, id)`, `day(timestamp)`) 
store transformed values in partition metadata (e.g., bucket number 2, not the 
actual `id` values 100, 200, 300) and must read source columns from the data 
file
   
   ### Changes Made
   
   1. **Added partition fields to `FileScanTask`** (`scan/task.rs`):
   - `partition: Option<Struct>` - Partition data from manifest entry
   - `partition_spec_id: Option<i32>` - Partition spec ID
   - `partition_spec: Option<Arc<PartitionSpec>>` - For transform-aware 
constant detection
   
   2. **Implemented `constants_map()` function** 
(`arrow/record_batch_transformer.rs`):
   - Replicates Java's `PartitionUtil.constantsMap()` behavior
   - Only includes fields where transform is `Transform::Identity`
   - Used to determine which fields use partition metadata constants vs. 
reading from data files
   
   3. **Enhanced `RecordBatchTransformer`** 
(`arrow/record_batch_transformer.rs`):
   - Added `build_with_partition_data()` method to accept partition spec/data
   - Implements spec's column resolution with identity-transform awareness
   - Detects field ID conflicts (add_files scenario) and falls back to 
name-based mapping
   
   4. **Updated `ArrowReader`** (`arrow/reader.rs`):
   - Uses `build_with_partition_data()` when partition information is available
   - Falls back to `build()` when not available
   
   5. **Updated manifest entry processing** (`scan/context.rs`):
   - Populates partition fields in `FileScanTask` from manifest entry data
   
   ### Tests Added
   
   1. **`bucket_partitioning_reads_source_column_from_file`** - Verifies that 
bucket-partitioned source columns are read from data files (not treated as 
constants from partition metadata)
   
   2. **`identity_partition_uses_constant_from_metadata`** - Verifies that 
identity-transformed fields correctly use partition metadata constants
   
   3. **`add_files_partition_columns_with_field_id_conflict`** - Verifies 
correct name-based fallback when partition column field IDs conflict with 
Parquet data column field IDs
   
   ## Are these changes tested?
   
   Yes, there are 3 new unit tests. This also resolved about 50 Iceberg Java 
tests when running with the DataFusion Comet's experimental 
https://github.com/apache/datafusion-comet/pull/2528 PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to