yyanyy commented on code in PR #11041: URL: https://github.com/apache/iceberg/pull/11041#discussion_r2082737494
########## format/view-spec.md: ########## @@ -160,6 +179,57 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _optional_ | `catalog` | A string specifying the name of the catalog. If set to `null`, the catalog is the same as the view's catalog | +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table/view | Review Comment: sorry I'm late to this discussion and this might be discussed/could be a stupid question - do we think if it's worth also include the snapshot id of the storage table, so that under circumstances where the storage table was updated but the materialized view was not or vice versa (race condition or commit partial succeeded), there's a reference link in both way (I know storage table already points to view version id under "refresh state"), which might be useful especially when view/table undergo version upgrades that could result in e.g. schema mismatch between the view and the underlying storage table? ########## format/view-spec.md: ########## @@ -160,6 +179,57 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _optional_ | `catalog` | A string specifying the name of the catalog. If set to `null`, the catalog is the same as the view's catalog | Review Comment: +1 on avoid a different `catalog` field that may introduce engine interoperability challenges that I personally feel quite unnecessary, unless there's a strong use case ########## format/view-spec.md: ########## @@ -160,6 +179,57 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _optional_ | `catalog` | A string specifying the name of the catalog. If set to `null`, the catalog is the same as the view's catalog | +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table/view | + +### Storage table metadata + +This section describes additional metadata for the storage table that supplements the regular table metadata and is required for materialzied views. Review Comment: nit - typo in `materialzied` ########## format/view-spec.md: ########## @@ -160,6 +179,57 @@ Each entry in `version-log` is a struct with the following fields: | _required_ | `timestamp-ms` | Timestamp when the view's `current-version-id` was updated (ms from epoch) | | _required_ | `version-id` | ID that `current-version-id` was set to | +#### Storage Table Identifier + +The table identifier for the storage table that stores the precomputed results. + +| Requirement | Field name | Description | +|-------------|----------------|-------------| +| _optional_ | `catalog` | A string specifying the name of the catalog. If set to `null`, the catalog is the same as the view's catalog | +| _required_ | `namespace` | A list of strings for namespace levels | +| _required_ | `name` | A string specifying the name of the table/view | + +### Storage table metadata Review Comment: I wonder if it's still worth mentioning the fields that are only applicable to materialized view in appendix of table spec, e.g. https://iceberg.apache.org/spec/#optional-snapshot-summary-fields and link to this page, so it is still searchable in the table spec page ########## format/view-spec.md: ########## @@ -42,12 +42,28 @@ An atomic swap of one view metadata file for another provides the basis for maki Writers create view metadata files optimistically, assuming that the current metadata location will not be changed before the writer's commit. Once a writer has created an update, it commits by swapping the view's metadata file pointer from the base location to the new location. +### Materialized Views + +Materialized views are a type of view with precomputed results from the view query stored as a table. +When queried, engines may return the precomputed data for the materialized views, shifting the cost of query execution to the precomputation step. + +Iceberg materialized views are implemented as a combination of an Iceberg view and an underlying Iceberg table, known as the storage table, which stores the precomputed data. +The metadata for a materialized view extends the Iceberg view metadata, adding a pointer to the precomputed data and refresh information to determine if the data is still fresh. +The refresh information is composed of data about the so-called "source tables", which are the tables referenced in the query definition of the materialized view. +The storage table can be in the states of "fresh", "stale" or "invalid", which are determined from the following situations: +* **fresh** -- The `snapshot_id`s of the last refresh operation match the current `snapshot_id`s of the source tables. +* **stale** -- The `snapshot_id`s do not match, indicating that a refresh operation needs to be performed to capture the latest source table changes. Review Comment: it seems that from the rest of the spec, nested view is indeed a valid use case, for materialized view referring other materialized views or even common views. However it seems to me that the rest of the spec didn't mention one of these states being recorded anywhere in the spec, which I would also agree to not include since it feels very challenging to ensure this info could be trustworthy especially when there are multiple engines doing reads/writes on the same data. With this, do we mention these `fresh`/`stale`/`invalid` here mostly for establishing a nomenclature for view users? If so I wonder if it's worth explicitly mentioning that to avoid the user expecting one of these three states to be recorded in their view metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org