wmoustafa opened a new pull request, #9830: URL: https://github.com/apache/iceberg/pull/9830
## Spec This patch adds support for materialized views in Iceberg and integrates the implementation with Spark SQL. It reuses the current spec of Iceberg views and tables by leveraging table properties to capture materialized view metadata. Those properties can be added to the Iceberg spec to formalize materialized view support. Below is a summary of all metadata properties introduced or utilized by this patch, classified based on whether they are associated with a table or a view, along with their purposes: ### Properties on a View: 1. **`iceberg.materialized.view`**: - **Type**: View property - **Purpose**: This property is used to mark whether a view is a materialized view. If set to `true`, the view is treated as a materialized view. This helps in differentiating between virtual and materialized views within the catalog and dictates specific handling and validation logic for materialized views. 2. **`iceberg.materialized.view.storage.location`**: - **Type**: View property - **Purpose**: Specifies the location of the storage table associated with the materialized view. This property is used for linking a materialized view with its corresponding storage table, enabling data management and query execution based on the stored data freshness. ### Properties on a Table: 1. **`base.snapshot.[UUID]`**: - **Type**: Table property - **Purpose**: These properties store the snapshot IDs of the base tables at the time the materialized view's data was last updated. Each property is prefixed with `base.snapshot.` followed by the UUID of the base table. They are used to track whether the materialized view's data is up to date with the base tables by comparing these snapshot IDs with the current snapshot IDs of the base tables. If all the base tables' current snapshot IDs match the ones stored in these properties, the materialized view's data is considered fresh. ## Spark SQL This patch introduces support for materialized views in the Spark module by adding support for Spark SQL `CREATE MATERIALIZED VIEW` and adding materialized view handling for the `DROP VIEW` DDL command. When a `CREATE MATERIALIZED VIEW` command is executed, the patch interprets the command to create a new materialized view, which involves not only registering the view's metadata (including marking it as a materialized view with the appropriate properties) but also setting up a corresponding storage table to hold the materialized data and setting the base table current snapshot IDs (at creation time). Conversely, when a `DROP VIEW` command is issued for a materialized view, the patch ensures that both the metadata for the materialized view and its associated storage table are properly removed from the catalog. Support for `REFRESH MATERIALIZED VIEW` is left as a future enhancement. ## Spark Catalog This patch enhances the `SparkCatalog` to intelligently decide whether to return the view text metadata for a materialized view or the data from its associated storage table based on the freshness of the materialized view. Within the `loadTable` method, the patch first checks if the requested table corresponds to a materialized view by loading the view from the Iceberg catalog. If the identified view is marked as a materialized view (using the `iceberg.materialized.view` property), the patch then assesses its freshness. If it is fresh, the `loadTable` method proceeds to load and return the storage table associated with the materialized view, allowing users to query the pre-computed data directly. However, if the materialized view is stale, the method simply returns to allow `SparkCatalog`'s `loadView` to run. In turn, `loadView` returns the metadata for the virtual view itself, triggering the usual Spark view logic that computes the result set based on the current state of the bas e tables. ## Storage Table API This patch utilizes the `HadoopCatalog` to manage the storage table associated with each materialized view by referencing the table directly by its location. This approach hides the storage table from being directly accessed or manipulated via the Spark SQL APIs, ensuring that the storage table remains an internal component of the materialized view implementation, thus maintaining the abstraction layer between the user-facing view definitions, namely from SQL, and the underlying catalog implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org