wmoustafa opened a new pull request, #9830:
URL: https://github.com/apache/iceberg/pull/9830
## Summary
This PR adds support for materialized views in Iceberg and integrates the
implementation with Spark SQL.
## Spec
Full Materialized View Spec can be found in #11041. A materialized view is
an Iceberg view whose current version has a `storage-table` field: a struct
with `namespace` and `name` identifying an Iceberg table that holds the
precomputed results. The storage table is used to return the precomputed
results of the view as long as the results are "fresh".
Freshness is tracked through a `refresh-state` JSON string stored in the
storage table's snapshot summary. The refresh state captures:
- The view version ID at the time of refresh
- The state of each source table or view (snapshot ID, version ID, UUID)
- The refresh start timestamp
A materialized view is considered fresh when the view version ID and all
source snapshot/version IDs in the refresh state match their current values.
## Core
New model classes:
- `ViewVersion.storageTable()` — nullable `TableIdentifier` on the view
version; non-null indicates a materialized view
- `RefreshState` / `RefreshStateParser` — model and JSON serialization for
refresh state stored in snapshot summaries
- `SourceState` / `SourceTableState` / `SourceViewState` — polymorphic
source state model discriminated by a `type` field (`table` or `view`)
## Spark SQL
This PR adds support for `CREATE MATERIALIZED VIEW` and extends `DROP VIEW`
to handle materialized views:
- `CREATE MATERIALIZED VIEW` creates the storage table first, then
registers the view metadata with a `storage-table` reference on the view
version. The storage table identifier can be specified via a `STORED AS
'<identifier>'` clause; otherwise a default `<name>__storage` identifier is
used.
- `DROP VIEW` on a materialized view removes both the view metadata and
its associated storage table.
- `REFRESH MATERIALIZED VIEW` is left as a future enhancement.
## Spark Catalog
The `SparkCatalog` determines whether to serve precomputed data from the
storage table or fall back to the view's SQL query:
- `loadTable()` checks if the requested identifier corresponds to a fresh
materialized view. If so, it returns a `SparkMaterializedView` backed by the
storage
table, allowing queries to read the precomputed data directly.
- `loadView()` checks if the materialized view is fresh. If fresh, it
defers to `loadTable()`. If stale, it returns a `SparkView`, triggering the
usual Spark view logic that re-executes the query against the current state of
the source tables.
## Notes
- The `InMemoryCatalog` has been extended with a test `LocalFileIO` to
support data file operations required by the storage table.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]