nazq opened a new pull request, #2566: URL: https://github.com/apache/iceberg-rust/pull/2566
## Which issue does this PR close? - Closes #2565. ## What changes are included in this PR? `TableScanBuilder::build()` currently validates caller-supplied column names against `snapshot.schema(metadata)`, which returns the schema the snapshot was written under. For default scans (no explicit `snapshot_id`) on a table whose current schema differs from the snapshot's schema — i.e. any table that's been through an `UpdateSchemaAction` commit — this rejects columns that are valid against the post-evolution schema: ``` DataInvalid => Column note not found in table. Schema: <pre-evolution schema> ``` ### Fix [`crates/iceberg/src/scan/mod.rs:221`](https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/scan/mod.rs#L221): ```diff - let schema = snapshot.schema(self.table.metadata())?; + let schema = if self.snapshot_id.is_some() { + snapshot.schema(self.table.metadata())? + } else { + self.table.metadata().current_schema().clone() + }; ``` - **Explicit `snapshot_id`** (time-travel): keep the snapshot-time vocabulary. A caller asking for "the table as it existed at snapshot N" should see schema N's columns. - **Default scan** (no `snapshot_id`): use the table's current schema. Field IDs are stable across schemas, so the downstream Parquet projection (`get_arrow_projection_mask_with_field_ids`, which reads `PARQUET:field_id` metadata embedded by the writer) still finds the right on-disk columns even when the file's column names differ from the current schema's column names. This matches PyIceberg's approach in `pyiceberg/io/pyarrow.py::_task_to_record_batches` — project by field ID, rename the arrow batch on the way out. The column-name validation loop (lines 224–237) and the subsequent `field_id_by_name` lookup (lines 256–261) share the same `schema` variable, so the fix is one assignment with a 12-line comment explaining the invariant. ### Why this wasn't caught earlier `UpdateSchemaAction` (#2120) shipped with metadata-only tests in `crates/catalog/loader/tests/schema_update_suite.rs` — none of them exercise `table.scan().select_columns(...)` after the schema commit. The pre-existing `crates/integration_tests/tests/read_evolved_schema.rs` only uses `table.scan().build()` with no `select_columns`, which bypasses the validation loop entirely. So the regression has been latent in the add/delete path since #2120 merged; it's just easier to trip with rename. ## Are these changes tested? Yes — three regression tests in `scan::tests`, all on the same minimal fixture (`make_schema_evolved_table`) with `current-schema-id=1` (`id, value`) and a sole snapshot at `schema-id=0` (`id, tmp`): | Test | What it covers | |---|---| | `test_default_scan_uses_current_schema_after_evolution` | `select(["id", "value"])` succeeds in the default scan — the post-evolution column name resolves | | `test_default_scan_rejects_old_name_after_rename` | `select(["id", "tmp"])` fails with `DataInvalid` — the pre-evolution name is no longer part of the public vocabulary in a default scan | | `test_snapshot_id_scan_uses_snapshot_schema` | `snapshot_id(1).select(["id", "tmp"])` succeeds (time-travel keeps the old vocabulary); `snapshot_id(1).select(["id", "value"])` fails | All 34 existing `scan::tests` still pass. Full iceberg lib suite: 1299/1299. Clippy + rustfmt clean. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
