viirya opened a new pull request, #2612: URL: https://github.com/apache/iceberg-rust/pull/2612
## Which issue does this PR close? - Closes #2403. ## What changes are included in this PR? When a Parquet file lacks embedded field IDs but a name mapping (`schema.name-mapping.default`) is available, the reader applies the mapping to the Arrow schema — but it still planned **column projection** and **predicate pushdown** with the position-based fallback (field id `N` → column `N-1`). As a result: - Projected columns whose mapped field IDs don't line up with their physical positions were silently read as all-NULL. - Predicates were evaluated against the wrong physical columns, silently returning wrong rows (this also makes row-group pruning and row selection prune incorrectly). This contradicts Java's `ReadConf`, which uses a three-branch strategy: embedded IDs → field-id projection; name mapping → apply mapping, then field-id projection; neither → position fallback. Our code applied the name mapping to the Arrow schema but then took the fallback branch anyway. Changes: - `pipeline.rs`: compute `use_position_fallback = missing_field_ids && task.name_mapping.is_none()` and pass it (instead of `missing_field_ids`) to both `get_arrow_projection_mask` and `build_field_id_set_and_map`, so position fallback only applies when there are no embedded field IDs **and** no name mapping. - `projection.rs`: `build_field_id_set_and_map` now distinguishes the name-mapping case. When the Parquet descriptor has no embedded field IDs but a name mapping assigned IDs to the Arrow schema, it builds the field-id → leaf-column map from the Arrow schema's `PARQUET:field_id` metadata (new `build_field_id_map_from_arrow_schema`). Arrow leaf ordering matches Parquet leaf column ordering (both depth-first), the same invariant `get_arrow_projection_mask` already relies on for `ProjectionMask::leaves`. ## Are these changes tested? Two new regression tests in `projection.rs`, covering both broken paths: - `test_read_parquet_with_name_mapping_uses_mapped_field_ids`: a file without field IDs whose columns `[name, subdept]` map to non-contiguous field IDs (2, 4). On main, projection NULL-fills both columns; with this fix, the values are read correctly. - `test_predicate_on_name_mapped_file_uses_mapped_field_ids`: a predicate `name = "Alice"` on the same file shape, with row-group filtering and row selection enabled. On main, the predicate is evaluated against the wrong physical column and returns 2 rows; with this fix, it returns the single correct row. Both tests fail on main (verified by reverting the `pipeline.rs` strategy line) and pass with the fix. The full `arrow::` unit-test suite (97 tests) passes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
