stevenzwu commented on code in PR #11041: URL: https://github.com/apache/iceberg/pull/11041#discussion_r2783981016
########## format/view-spec.md: ########## @@ -160,6 +176,109 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +Consumers should only read from the storage table if the materialized view is "fresh" and therefore adequately represents the logical query definition of the view. +Different systems define freshness differently based on time-based and logical factors. + +**Time-based freshness (consumer-defined):** + +Consumers may apply time-based freshness policies, such as allowing a certain staleness window based on `refresh-start-timestamp-ms`. +When evaluating freshness, consumers: +- Must first evaluate their own time-based freshness policy. +- May additionally compare the `source-states` list against the states loaded from the catalog to verify the producers logical freshness policy. +- May parse the view definition to implement a more sophisticated policy. +- When a materialized view is considered stale, can fail, refresh inline, or treat the materialized view as a logical view. +- Must not read from the storage table when the materialized view doesn't meet freshness criteria. + +**Logical freshness (producer-defined):** Review Comment: I am not sure that we should call out consumer-defined and producer-defined. While producer populates the refresh-state, it is still up to consumers to interpret it. ########## format/view-spec.md: ########## @@ -160,6 +176,109 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +Consumers should only read from the storage table if the materialized view is "fresh" and therefore adequately represents the logical query definition of the view. +Different systems define freshness differently based on time-based and logical factors. + +**Time-based freshness (consumer-defined):** + +Consumers may apply time-based freshness policies, such as allowing a certain staleness window based on `refresh-start-timestamp-ms`. +When evaluating freshness, consumers: +- Must first evaluate their own time-based freshness policy. +- May additionally compare the `source-states` list against the states loaded from the catalog to verify the producers logical freshness policy. +- May parse the view definition to implement a more sophisticated policy. +- When a materialized view is considered stale, can fail, refresh inline, or treat the materialized view as a logical view. +- Must not read from the storage table when the materialized view doesn't meet freshness criteria. + +**Logical freshness (producer-defined):** + +Producers define the logical freshness policy and provide the necessary information in the [refresh state](#refresh-state) to verify the logical equivalence of the precomputed data with the query definition. +Different producers may define different logical freshness policies, based on how much of the dependency graph must be current. +Some require the entire query tree to be fully up to date, while others only require direct children or leaf nodes. +When writing the refresh state, producers: +- Must provide a sufficient list of source states so that consumers can determine freshness according to the producer's policy. +- May leave the source states list empty if the source state cannot be determined for all objects (for example, for non-Iceberg tables). +- Must store the entry with the oldest snapshot-id or version-id when the same source object appears multiple times in the dependency graph (for example, in diamond patterns). Review Comment: diamond pattern may not be a well known term. we may need to explain the scenario. ########## format/view-spec.md: ########## @@ -160,6 +176,109 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +Consumers should only read from the storage table if the materialized view is "fresh" and therefore adequately represents the logical query definition of the view. +Different systems define freshness differently based on time-based and logical factors. + +**Time-based freshness (consumer-defined):** + +Consumers may apply time-based freshness policies, such as allowing a certain staleness window based on `refresh-start-timestamp-ms`. +When evaluating freshness, consumers: +- Must first evaluate their own time-based freshness policy. +- May additionally compare the `source-states` list against the states loaded from the catalog to verify the producers logical freshness policy. Review Comment: the last 4 bullet points aren't related to time-based freshness. they are independent of time-based vs logical freshness. ########## format/view-spec.md: ########## @@ -160,6 +176,109 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialized views. +The property "refresh-state" is set on the [snapshot summary](https://iceberg.apache.org/spec/#snapshots) property of every storage table snapshot to determine the freshness of the precomputed data of the storage table. + +| Requirement | Field name | Description | +|-------------|-----------------|-------------| +| _required_ | `refresh-state` | A [refresh state](#refresh-state) record stored as a JSON-encoded string | + +#### Freshness + +Consumers should only read from the storage table if the materialized view is "fresh" and therefore adequately represents the logical query definition of the view. +Different systems define freshness differently based on time-based and logical factors. + +**Time-based freshness (consumer-defined):** + +Consumers may apply time-based freshness policies, such as allowing a certain staleness window based on `refresh-start-timestamp-ms`. +When evaluating freshness, consumers: +- Must first evaluate their own time-based freshness policy. +- May additionally compare the `source-states` list against the states loaded from the catalog to verify the producers logical freshness policy. +- May parse the view definition to implement a more sophisticated policy. +- When a materialized view is considered stale, can fail, refresh inline, or treat the materialized view as a logical view. +- Must not read from the storage table when the materialized view doesn't meet freshness criteria. Review Comment: some engine (like BigQuery) may combine the data from storage table + the delta from source tables. So it is not entirely correct to say `Must not read`. not sure the best wording here. `Must not consume the storage table as it is`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
