viirya opened a new pull request, #2612:
URL: https://github.com/apache/iceberg-rust/pull/2612

   ## Which issue does this PR close?
   
   - Closes #2403.
   
   ## What changes are included in this PR?
   
   When a Parquet file lacks embedded field IDs but a name mapping 
(`schema.name-mapping.default`) is available, the reader applies the mapping to 
the Arrow schema — but it still planned **column projection** and **predicate 
pushdown** with the position-based fallback (field id `N` → column `N-1`). As a 
result:
   
   - Projected columns whose mapped field IDs don't line up with their physical 
positions were silently read as all-NULL.
   - Predicates were evaluated against the wrong physical columns, silently 
returning wrong rows (this also makes row-group pruning and row selection prune 
incorrectly).
   
   This contradicts Java's `ReadConf`, which uses a three-branch strategy: 
embedded IDs → field-id projection; name mapping → apply mapping, then field-id 
projection; neither → position fallback. Our code applied the name mapping to 
the Arrow schema but then took the fallback branch anyway.
   
   Changes:
   
   - `pipeline.rs`: compute `use_position_fallback = missing_field_ids && 
task.name_mapping.is_none()` and pass it (instead of `missing_field_ids`) to 
both `get_arrow_projection_mask` and `build_field_id_set_and_map`, so position 
fallback only applies when there are no embedded field IDs **and** no name 
mapping.
   - `projection.rs`: `build_field_id_set_and_map` now distinguishes the 
name-mapping case. When the Parquet descriptor has no embedded field IDs but a 
name mapping assigned IDs to the Arrow schema, it builds the field-id → 
leaf-column map from the Arrow schema's `PARQUET:field_id` metadata (new 
`build_field_id_map_from_arrow_schema`). Arrow leaf ordering matches Parquet 
leaf column ordering (both depth-first), the same invariant 
`get_arrow_projection_mask` already relies on for `ProjectionMask::leaves`.
   
   ## Are these changes tested?
   
   Two new regression tests in `projection.rs`, covering both broken paths:
   
   - `test_read_parquet_with_name_mapping_uses_mapped_field_ids`: a file 
without field IDs whose columns `[name, subdept]` map to non-contiguous field 
IDs (2, 4). On main, projection NULL-fills both columns; with this fix, the 
values are read correctly.
   - `test_predicate_on_name_mapped_file_uses_mapped_field_ids`: a predicate 
`name = "Alice"` on the same file shape, with row-group filtering and row 
selection enabled. On main, the predicate is evaluated against the wrong 
physical column and returns 2 rows; with this fix, it returns the single 
correct row.
   
   Both tests fail on main (verified by reverting the `pipeline.rs` strategy 
line) and pass with the fix. The full `arrow::` unit-test suite (97 tests) 
passes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to