Re: [PR] Materialized View Spec [iceberg]

via GitHub Tue, 10 Feb 2026 23:19:43 -0800


bennychow commented on code in PR #11041:
URL: https://github.com/apache/iceberg/pull/11041#discussion_r2791033662



##########
format/spec.md:
##########
@@ -1846,3 +1846,7 @@ The Geometry and Geography class hierarchy and its 
Well-known text (WKT) and Wel
 Points are always defined by the coordinates X, Y, Z (optional), and M 
(optional), in this order. X is the longitude/easting, Y is the 
latitude/northing, and Z is usually the height, or elevation. M is a fourth 
optional dimension, for example a linear reference value (e.g., highway 
milepost value), a timestamp, or some other value as defined by the CRS.
 
 The version of the OGC standard first used here is 1.2.1, but future versions 
may also be used if the WKB representation remains wire-compatible.
+
+## Appendix H: Materialized Views
+
+Iceberg tables can be used as storage tables for [Iceberg Materialized 
Views](view-spec.md#materialized-views). The Materialized View specification is 
an extension of the [View Spec](view-spec.md) that defines how precomputed 
query results are stored and maintained using Iceberg tables as the underlying 
storage layer.

Review Comment:
   The spec doesn't cover MV maintenance like full vs incremental refresh.  So, 
I don't think we should include "**and maintained**" here.



##########
format/view-spec.md:
##########
@@ -42,12 +42,25 @@ An atomic swap of one view metadata file for another 
provides the basis for maki
 
 Writers create view metadata files optimistically, assuming that the current 
metadata location will not be changed before the writer's commit. Once a writer 
has created an update, it commits by swapping the view's metadata file pointer 
from the base location to the new location.
 
+### Materialized Views
+
+Materialized views are a type of view with precomputed results from the view 
query stored as a table.
+When queried, engines may return the precomputed data for the materialized 
views, shifting the cost of query execution to the precomputation step.
+
+Iceberg materialized views are implemented as a combination of an Iceberg view 
and an underlying Iceberg table, the "storage-table", which stores the 
precomputed data.
+Materialized View metadata is a superset of View metadata with an additional 
pointer to the storage table. The storage table is an Iceberg table with 
additional materialized view refresh state metadata.
+Refresh metadata contains information about the "source tables", "source 
views", and/or "source materialized views", which are the 
tables/views/materialized views referenced in the query definition of the 
materialized view.
+
 ## Specification
 
 ### Terms
 
 * **Schema** -- Names and types of fields in a view.
 * **Version** -- The state of a view at some point in time.
+* **Storage table** -- Iceberg table that stores the precomputed data of a 
materialized view.
+* **Source table** -- A table reference that occurs in the query definition of 
a materialized view.

Review Comment:
   I feel this definition is omitting the important fact that it's the table 
version at the time of refresh.
   
   If we can do my suggestion here:  
https://github.com/apache/iceberg/pull/11041/changes#r2791741504
   
   Then we can define source table as:  A table reference that occurs in the 
refresh state of a materialized view.
   
   



##########
format/view-spec.md:
##########
@@ -160,6 +176,109 @@ Each entry in `version-log` is a struct with the 
following fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The property "refresh-state" is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of every storage 
table snapshot to determine the freshness of the precomputed data of the 
storage table.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _required_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is "fresh" when the storage table adequately represents 
the logical query definition of the view.
+Since different systems define freshness differently, it is left to the 
consumer to evaluate freshness based on its own policy.
+
+**Consumer behavior:**
+
+When evaluating freshness, consumers:
+- May apply time-based freshness policies, such as allowing a staleness window 
based on `refresh-start-timestamp-ms`.
+- May compare the `source-states` list against the states loaded from the 
catalog to verify the producer's freshness interpretation.
+- May parse the view definition to implement more sophisticated policies.
+- When a materialized view is considered stale, can fail, refresh inline, or 
treat the materialized view as a logical view.
+- Should not consume the storage table as it is when the materialized view 
doesn't meet the freshness criteria.
+
+**Producer behavior:**
+
+Producers should provide the necessary information in the [refresh 
state](#refresh-state) such that consumers can verify the logical equivalence 
of the precomputed data with the query definition.
+Different producers may have different freshness interpretations, based on how 
much of the dependency graph must be current.
+Some require the entire query tree to be fully up to date, while others only 
require direct children or leaf nodes.
+
+When writing the refresh state, producers:
+- Should provide a sufficient list of source states such that consumers can 
determine freshness according to the producer's interpretation.
+- May leave the source states list empty if the source state cannot be 
determined for all objects (for example, for non-Iceberg tables).
+- Must store the entry with the oldest snapshot-id or version-id when the same 
source object is reachable through multiple paths in the dependency graph 
(diamond dependency pattern).
+
+#### Refresh state
+
+The refresh state record captures the dependencies in the materialized view's 
dependency graph.

Review Comment:
   Here we define the refresh state and dependency graph but we already refer 
to the dependency graph in the producer behavior.  Would it be better to add 
"**Refresh State**" as a new term under storage table and define it as 
described here:
   
   **Refresh State** -- Captures the dependencies in the materialized view's 
dependency graph at the time of refresh.  These dependencies could include 
source Iceberg tables, views, and nested materialized views.



##########
format/view-spec.md:
##########
@@ -160,6 +176,109 @@ Each entry in `version-log` is a struct with the 
following fields:
 | _required_  | `timestamp-ms` | Timestamp when the view's 
`current-version-id` was updated (ms from epoch) |
 | _required_  | `version-id`   | ID that `current-version-id` was set to |
 
+#### Storage Table Identifier
+
+The table identifier for the storage table that stores the precomputed results.
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `namespace`    | A list of strings for namespace levels |
+| _required_  | `name`         | A string specifying the name of the table |
+
+### Storage table metadata
+
+This section describes additional metadata for the storage table that 
supplements the regular table metadata and is required for materialized views.
+The property "refresh-state" is set on the [snapshot 
summary](https://iceberg.apache.org/spec/#snapshots) property of every storage 
table snapshot to determine the freshness of the precomputed data of the 
storage table.
+
+| Requirement | Field name      | Description |
+|-------------|-----------------|-------------|
+| _required_  | `refresh-state` | A [refresh state](#refresh-state) record 
stored as a JSON-encoded string |
+
+#### Freshness
+
+A materialized view is "fresh" when the storage table adequately represents 
the logical query definition of the view.
+Since different systems define freshness differently, it is left to the 
consumer to evaluate freshness based on its own policy.
+
+**Consumer behavior:**
+
+When evaluating freshness, consumers:
+- May apply time-based freshness policies, such as allowing a staleness window 
based on `refresh-start-timestamp-ms`.
+- May compare the `source-states` list against the states loaded from the 
catalog to verify the producer's freshness interpretation.
+- May parse the view definition to implement more sophisticated policies.
+- When a materialized view is considered stale, can fail, refresh inline, or 
treat the materialized view as a logical view.
+- Should not consume the storage table as it is when the materialized view 
doesn't meet the freshness criteria.
+
+**Producer behavior:**
+
+Producers should provide the necessary information in the [refresh 
state](#refresh-state) such that consumers can verify the logical equivalence 
of the precomputed data with the query definition.
+Different producers may have different freshness interpretations, based on how 
much of the dependency graph must be current.
+Some require the entire query tree to be fully up to date, while others only 
require direct children or leaf nodes.
+
+When writing the refresh state, producers:
+- Should provide a sufficient list of source states such that consumers can 
determine freshness according to the producer's interpretation.
+- May leave the source states list empty if the source state cannot be 
determined for all objects (for example, for non-Iceberg tables).
+- Must store the entry with the oldest snapshot-id or version-id when the same 
source object is reachable through multiple paths in the dependency graph 
(diamond dependency pattern).
+
+#### Refresh state
+
+The refresh state record captures the dependencies in the materialized view's 
dependency graph.
+These dependencies include source Iceberg tables, views, and nested 
materialized views.
+
+The refresh state has the following fields:
+
+| Requirement | Field name     | Description |
+|-------------|----------------|-------------|
+| _required_  | `view-version-id`         | The `version-id` of the 
materialized view when the refresh operation was performed  |
+| _required_  | `source-states`        | A list of [source 
states](#source-state) records |
+| _required_  | `refresh-start-timestamp-ms` | A timestamp of when the refresh 
operation was started |
+
+#### Source state
+
+Source state records capture the state of objects referenced by a materialized 
view.
+Each record has a `type` field that determines its form:
+
+| Type    | Description |
+|---------|-------------|
+| `table` | An Iceberg table, including storage tables of nested materialized 
views |

Review Comment:
   Are tables referenced from nested materialized views included in this list?  
I guess this is up to the producer but we should state that clearly here.
   
   Also, "nested" MVs and "source" MVs are the same thing, right?  If so, can 
we use a consistent name?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Materialized View Spec [iceberg]

Reply via email to