[PR] feat(reader): position-based column projection for Parquet files without field IDs (migrated tables) [iceberg-rust]

via GitHub Tue, 21 Oct 2025 19:00:38 -0700


mbutrovich opened a new pull request, #1777:
URL: https://github.com/apache/iceberg-rust/pull/1777


   ## Which issue does this PR close?
   
   - N/A.
   
   ## Rationale for this change
   
   **Background**: This issue was discovered when running Iceberg Java's test 
suite against our [experimental DataFusion Comet branch that uses 
iceberg-rust](https://github.com/apache/datafusion-comet/pull/2528). Many 
failures occurred in `TestMigrateTableAction.java`, which tests reading Parquet 
files from migrated tables (_e.g.,_ from Hive or Spark) that lack embedded 
field ID metadata.
   
   **Problem**: The Rust ArrowReader was unable to read these files, while 
Iceberg Java handles them using a position-based fallback where top-level field 
ID N maps to top-level Parquet column position N-1, and entire columns 
(including nested content) are projected.
   
   
   ## What changes are included in this PR?
   
   This PR implements position-based column projection for Parquet files 
without field IDs, enabling iceberg-rust to read migrated tables.
   
     **Solution**: Implemented fallback projection in 
`ArrowReader::get_arrow_projection_mask_fallback()` that matches Java's
     `ParquetSchemaUtil.pruneColumnsFallback()` behavior:
     - Detects Parquet files without field IDs by checking Arrow schema metadata
     - Maps top-level field IDs to top-level column positions (field IDs are 
1-indexed, positions are 0-indexed)
     - Uses `ProjectionMask::roots()` to project entire columns including 
nested content (structs, lists, maps)
     - Adds field ID metadata to the projected schema for 
`RecordBatchTransformer`
     - Supports schema evolution by allowing missing columns (filled with 
default values by `RecordBatchTransformer`)
   
     This implementation now matches Iceberg Java's behavior for reading 
migrated tables, enabling interoperability with Java-based tooling and 
workflows.
   
     ## Are these changes tested?
   
     Yes, comprehensive unit tests were added to verify the fallback path works 
correctly:
     - `test_read_parquet_file_without_field_ids` - Basic projection with 
primitive columns using position-based mapping
     - `test_read_parquet_without_field_ids_partial_projection` - Project 
subset of columns
     - `test_read_parquet_without_field_ids_schema_evolution` - Handle missing 
columns with NULL values
     - `test_read_parquet_without_field_ids_multiple_row_groups` - Verify 
behavior across row group boundaries
     - `test_read_parquet_without_field_ids_with_struct` - Project structs with 
nested fields (entire top-level column)
   
     All tests verify that behavior matches Iceberg Java's 
`pruneColumnsFallback()` implementation in
     `/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat(reader): position-based column projection for Parquet files without field IDs (migrated tables) [iceberg-rust]

Reply via email to