szehon-ho commented on issue #6420: URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1912878826
Hi @JanKaul . Thanks for putting this together. I went through the detailed discussion, and see the general consensus to the "Open Questions" in the design docs are: 1. The pointer to the storage table should be stored as an optional field in the view metadata (option 1) 2. Lineage information should be stored as additional fields in the summary of the storage table (option 2) 3. Only the view (and not storage table) should be registered in catalog (option 2) Is that correct? Then given that, I have summarized the additions we are making to the current metadata spec below. # View Metadata | v1 | v2 | Field Name | Description | |---|---|---|---| | optional | materialization | An optional `materialization` struct. If the value is null the entity is a common view, otherwise it is a materialized view | Materialization Struct | v1 | Field Name | Description | | -- | -- | -- | required | format-version | An integer version number for the materialized view format. Currently, this must be 1. Implementations must throw an exception if the materialized view's version is higher than the supported version. required | storage-table | Table metadata location | # Snapshot | v1 | Field Name | Description | | -- | -- | -- | | optional | refresh-version-id | Version id of the materialized view when the refresh operation was performed. optional | source-tables | A List of `source-table` records. | Source Table Struct | v1 | Field Name | Description | | -- | -- | -- | | required | identifier | Identifier of the table as defined in the SQL expression. | | required | snapshot-id | Snapshot id of the source table when the last refresh operation was performed. | Let me know if that looks right. My 2c on this are: 1. The materialization struct having its own format version seems overkill to me, maybe we can just flatten it and make the materialization directly just the storage-table pointer itself? 2. Similar to @jackye1995 on the comment above: https://github.com/apache/iceberg/issues/6420#issuecomment-1398572156, I feel having the list of source-tables is a bit difficult. Can we proceed without this in the first cut? I feel the engines, if they wanted, could parse the source tables, look them up, and get the snapshot-ids directly. They must to be able to parse the view-sql so should be able to parse that. 3. How about the 'refresh-strategy'? (Didnt see it in the google-doc). I feel it can in the current 'properties' field of view metadata. iiuc, @rdblue also had suggested putting this in the table properties of the storage table along with other fields like materialized_view_format_version and view_identifier, which sounds fine too. If there is general consensus on the direction, it'd be great to move to on the actual spec pr change and discuss specifics there, as seems like this proposal has been sitting awhile? I can also help with that, if needed. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org