xanderbailey opened a new pull request, #2153: URL: https://github.com/apache/iceberg-rust/pull/2153
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes https://github.com/apache/iceberg-rust/issues/2152 ## What changes are included in this PR? This PR adds incremental snapshot scanning support to iceberg-rust, similar to the Java client's `IncrementalDataTableScan`. This feature allows reading only the data files that were added between two snapshots, which is essential for: - Change data capture (CDC) pipelines - Incremental data processing - Efficiently reading only new data since a checkpoint ### Core Iceberg Changes (`crates/iceberg/src/scan/`) **New API on `TableScanBuilder`:** ```rust // Scan changes between two snapshots (from exclusive, to inclusive) table.scan() .from_snapshot_exclusive(from_id) .to_snapshot(to_id) .build()?; // Scan changes with inclusive from table.scan() .from_snapshot_inclusive(from_id) .to_snapshot(to_id) .build()?; // Convenience methods (matching Java API) table.scan().appends_after(from_id).build()?; table.scan().appends_between(from_id, to_id).build()?; ``` **Implementation details:** - Added `SnapshotRange` struct to validate snapshot ancestry and track snapshot IDs in range - Modified `ManifestFileContext` to filter entries with `status=ADDED` and `snapshot_id` within range - Validates that only APPEND operations are in the snapshot range (matches Java behavior) - Returns clear error if from_snapshot is not an ancestor of to_snapshot ### DataFusion Integration (`crates/integrations/datafusion/`) **New constructors on `IcebergStaticTableProvider`:** ```rust // Scan changes between two snapshots let provider = IcebergStaticTableProvider::try_new_incremental(table, from_id, to_id).await?; // Scan with inclusive from let provider = IcebergStaticTableProvider::try_new_incremental_inclusive(table, from_id, to_id).await?; // Scan all appends after a snapshot to current let provider = IcebergStaticTableProvider::try_new_appends_after(table, from_id).await?; // Register with DataFusion and query ctx.register_table("changes", Arc::new(provider))?; let df = ctx.sql("SELECT * FROM changes WHERE category = 'A'").await?; ``` ### Example Added (`crates/examples/`) Added `datafusion_incremental_read.rs` example demonstrating: - Incremental reads between two snapshots - Using `appends_after()` for checkpoint-based processing - Combining incremental reads with SQL filters ### Files Changed | File | Changes | |------|---------| | `crates/iceberg/src/scan/mod.rs` | Added `SnapshotRange`, incremental scan methods on `TableScanBuilder` | | `crates/iceberg/src/scan/context.rs` | Added `snapshot_range` to contexts, manifest entry filtering | | `crates/integrations/datafusion/src/table/mod.rs` | Added incremental constructors to `IcebergStaticTableProvider` | | `crates/integrations/datafusion/src/physical_plan/scan.rs` | Updated `IcebergTableScan` and `get_batch_stream()` | | `crates/examples/src/datafusion_incremental_read.rs` | New example | | `crates/examples/Cargo.toml` | Added datafusion dependencies and example entry | ## Are these changes tested? Yes, this PR includes comprehensive tests: ### Core Iceberg Tests (`crates/iceberg/src/scan/mod.rs`) - `test_incremental_scan_mutually_exclusive_with_snapshot_id` - Verifies snapshot_id and incremental options are mutually exclusive - `test_incremental_scan_invalid_from_snapshot` - Verifies error when from is not ancestor of to - `test_incremental_scan_invalid_to_snapshot` - Verifies error for non-existent to_snapshot - `test_appends_after_convenience_method` - Tests the convenience method - `test_appends_between_convenience_method` - Tests the convenience method - `test_incremental_scan_from_snapshot_inclusive` - Tests inclusive from behavior - `test_incremental_scan_from_snapshot_exclusive` - Tests exclusive from behavior ### DataFusion Integration Tests (`crates/integrations/datafusion/src/table/mod.rs`) - `test_static_provider_incremental_creates_scan` - Verifies scan parameters are set correctly - `test_static_provider_incremental_inclusive` - Tests inclusive flag - `test_static_provider_appends_after` - Tests appends_after configuration - `test_static_provider_incremental_invalid_snapshot` - Tests error handling All existing tests continue to pass (41 scan tests, 8 static provider tests). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
