bennychow commented on code in PR #11041: URL: https://github.com/apache/iceberg/pull/11041#discussion_r2794162281
########## format/view-spec.md: ########## @@ -160,6 +176,109 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +A materialized view is "fresh" when the storage table adequately represents the logical query definition of the view. +Since different systems define freshness differently, it is left to the consumer to evaluate freshness based on its own policy. + +**Consumer behavior:** + +When evaluating freshness, consumers: +- May apply time-based freshness policies, such as allowing a staleness window based on `refresh-start-timestamp-ms`. +- May compare the `source-states` list against the states loaded from the catalog to verify the producer's freshness interpretation. +- May parse the view definition to implement more sophisticated policies. +- When a materialized view is considered stale, can fail, refresh inline, or treat the materialized view as a logical view. +- Should not consume the storage table as it is when the materialized view doesn't meet the freshness criteria. + +**Producer behavior:** + +Producers should provide the necessary information in the [refresh state](#refresh-state) such that consumers can verify the logical equivalence of the precomputed data with the query definition. +Different producers may have different freshness interpretations, based on how much of the dependency graph must be current. Review Comment: How about this: Different produces may have different freshness interpretations, based on how much of the refresh state's dependency graph should be evaluated. Some producers expect the entire dependency graph to be evaluated and therefore include nested MV dependencies. Other producers may only expect dependencies in the MV's SQL to be evaluated and therefore do not include dependencies within nested MVs. ########## format/view-spec.md: ########## @@ -42,12 +42,25 @@ An atomic swap of one view metadata file for another provides the basis for maki Writers create view metadata files optimistically, assuming that the current metadata location will not be changed before the writer's commit. Once a writer has created an update, it commits by swapping the view's metadata file pointer from the base location to the new location. +### Materialized Views + +Materialized views are a type of view with precomputed results from the view query stored as a table. +When queried, engines may return the precomputed data for the materialized views, shifting the cost of query execution to the precomputation step. + +Iceberg materialized views are implemented as a combination of an Iceberg view and an underlying Iceberg table, the "storage-table", which stores the precomputed data. +Materialized View metadata is a superset of View metadata with an additional pointer to the storage table. The storage table is an Iceberg table with additional materialized view refresh state metadata. +Refresh metadata contains information about the "source tables", "source views", and/or "source materialized views", which are the tables/views/materialized views referenced in the query definition of the materialized view. + ## Specification ### Terms * **Schema** -- Names and types of fields in a view. * **Version** -- The state of a view at some point in time. +* **Storage table** -- Iceberg table that stores the precomputed data of a materialized view. +* **Source table** -- A table reference that occurs in the query definition of a materialized view. Review Comment: Got it. That makes sense to me too. I still hope we can add "Refresh state" to this list. ########## format/view-spec.md: ########## @@ -160,6 +176,109 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +A materialized view is "fresh" when the storage table adequately represents the logical query definition of the view. +Since different systems define freshness differently, it is left to the consumer to evaluate freshness based on its own policy. + +**Consumer behavior:** + +When evaluating freshness, consumers: +- May apply time-based freshness policies, such as allowing a staleness window based on `refresh-start-timestamp-ms`. +- May compare the `source-states` list against the states loaded from the catalog to verify the producer's freshness interpretation. +- May parse the view definition to implement more sophisticated policies. +- When a materialized view is considered stale, can fail, refresh inline, or treat the materialized view as a logical view. +- Should not consume the storage table as it is when the materialized view doesn't meet the freshness criteria. + +**Producer behavior:** + +Producers should provide the necessary information in the [refresh state](#refresh-state) such that consumers can verify the logical equivalence of the precomputed data with the query definition. +Different producers may have different freshness interpretations, based on how much of the dependency graph must be current. +Some require the entire query tree to be fully up to date, while others only require direct children or leaf nodes. + +When writing the refresh state, producers: +- Should provide a sufficient list of source states such that consumers can determine freshness according to the producer's interpretation. +- May leave the source states list empty if the source state cannot be determined for all objects (for example, for non-Iceberg tables). +- Must store the entry with the oldest snapshot-id or version-id when the same source object is reachable through multiple paths in the dependency graph (diamond dependency pattern). + +#### Refresh state + +The refresh state record captures the dependencies in the materialized view's dependency graph. +These dependencies include source Iceberg tables, views, and nested materialized views. + +The refresh state has the following fields: + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `view-version-id` | The `version-id` of the materialized view when the refresh operation was performed | +| _required_ | `source-states` | A list of [source states](#source-state) records | +| _required_ | `refresh-start-timestamp-ms` | A timestamp of when the refresh operation was started | + +#### Source state + +Source state records capture the state of objects referenced by a materialized view. +Each record has a `type` field that determines its form: + +| Type | Description | +|---------|-------------| +| `table` | An Iceberg table, including storage tables of nested materialized views | Review Comment: Here's a suggestion for improving line 215: https://github.com/apache/iceberg/pull/11041/changes#r2794162281 For this section, how about we add this to the header: Before: Source state records capture the state of objects referenced by a materialized view. After: Source state records capture the state of objects referenced by a materialized view including objects referenced by nested materialized views. ########## format/view-spec.md: ########## @@ -160,6 +176,109 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +A materialized view is "fresh" when the storage table adequately represents the logical query definition of the view. +Since different systems define freshness differently, it is left to the consumer to evaluate freshness based on its own policy. + +**Consumer behavior:** + +When evaluating freshness, consumers: +- May apply time-based freshness policies, such as allowing a staleness window based on `refresh-start-timestamp-ms`. +- May compare the `source-states` list against the states loaded from the catalog to verify the producer's freshness interpretation. +- May parse the view definition to implement more sophisticated policies. +- When a materialized view is considered stale, can fail, refresh inline, or treat the materialized view as a logical view. +- Should not consume the storage table as it is when the materialized view doesn't meet the freshness criteria. + +**Producer behavior:** + +Producers should provide the necessary information in the [refresh state](#refresh-state) such that consumers can verify the logical equivalence of the precomputed data with the query definition. +Different producers may have different freshness interpretations, based on how much of the dependency graph must be current. +Some require the entire query tree to be fully up to date, while others only require direct children or leaf nodes. + +When writing the refresh state, producers: +- Should provide a sufficient list of source states such that consumers can determine freshness according to the producer's interpretation. +- May leave the source states list empty if the source state cannot be determined for all objects (for example, for non-Iceberg tables). +- Must store the entry with the oldest snapshot-id or version-id when the same source object is reachable through multiple paths in the dependency graph (diamond dependency pattern). + +#### Refresh state + +The refresh state record captures the dependencies in the materialized view's dependency graph. Review Comment: I'm suggesting to define up front for the reader the terms "Refresh State" and "Dependency Graph" since they are key to understand the spec. So, under Specification -> Terms, here: https://github.com/apache/iceberg/pull/11041/changes#diff-4680d52dc70590abc27b56e8da794ddae43ef39ffdd0099b0d6d3802a12eb74fR60 add a new bullet for "Refresh State" This bullet would go below or above the Storage Table term. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
