kevinjqliu commented on code in PR #11660: URL: https://github.com/apache/iceberg/pull/11660#discussion_r1905905118
########## format/spec.md: ########## @@ -1633,3 +1633,57 @@ might indicate different snapshot IDs for a specific timestamp. The discrepancie When processing point in time queries implementations should use "snapshot-log" metadata to lookup the table state at the given point in time. This ensures time-travel queries reflect the state of the table at the provided timestamp. For example a SQL query like `SELECT * FROM prod.db.table TIMESTAMP AS OF '1986-10-26 01:21:00Z';` would find the snapshot of the Iceberg table just prior to '1986-10-26 01:21:00 UTC' in the snapshot logs and use the metadata from that snapshot to perform the scan of the table. If no snapshot exists prior to the timestamp given or "snapshot-log" is not populated (it is an optional field), then systems should raise an informative error message about the missing metadata. +## Appendix G: Optional Snapshot Summary Fields + +### Metrics +Snapshot summary can include metrics fields to track numeric stats of the snapshot. The value of these fields should be numeric strings (e.g., `"120"`). +Some of them are also used to represent partition-level metrics, in [Partition-Level Summary](#partition-level-summary). +Metrics must be accurate if written, as engines may rely on them for optimization. + +| Field | Description | Used in Partition-Level Summary | +|-------------------------------------|-------------------------------------------------------------------------------------------------------|---------------------------------| +| **`added-data-files`** | Number of data files added in the snapshot | Yes | +| **`deleted-data-files`** | Number of data files deleted in the snapshot | Yes | +| **`total-data-files`** | Total number of live data files in the snapshot | No | +| **`added-delete-files`** | Number of positional/equality delete files and deletion vectors added in the snapshot | Yes | +| **`added-equality-delete-files`** | Number of equality delete files added in the snapshot | Yes | +| **`removed-equality-delete-files`** | Number of equality delete files removed in the snapshot | Yes | +| **`added-position-delete-files`** | Number of position delete files added in the snapshot | Yes | +| **`removed-position-delete-files`** | Number of position delete files removed in the snapshot | Yes | +| **`added-dvs`** | Number of deletion vectors added in the snapshot | Yes | +| **`removed-dvs`** | Number of deletion vectors removed in the snapshot | Yes | +| **`removed-delete-files`** | Number of positional/equality delete files and deletion vectors removed in the snapshot | Yes | +| **`total-delete-files`** | Total number of live positional/equality delete files and deletion vectors in the snapshot | No | +| **`added-records`** | Number of records added in the snapshot | Yes | +| **`deleted-records`** | Number of records deleted in the snapshot | Yes | +| **`total-records`** | Total number of records in the snapshot | No | +| **`added-files-size`** | The size of files added in the snapshot | Yes | +| **`removed-files-size`** | The size of files removed in the snapshot | Yes | +| **`total-files-size`** | The size of all files in the snapshot | No | +| **`added-position-deletes`** | Number of position delete records added in the snapshot | Yes | +| **`removed-position-deletes`** | Number of position delete records removed in the snapshot | Yes | +| **`total-position-deletes`** | Total number of position delete records in the snapshot | No | +| **`added-equality-deletes`** | Number of equality delete records added in the snapshot | Yes | +| **`removed-equality-deletes`** | Number of equality delete records removed in the snapshot | Yes | +| **`total-equality-deletes`** | Total number of equality delete records in the snapshot | No | +| **`deleted-duplicate-files`** | Number of duplicate files deleted, where duplicates are files recorded more than once in the manifest | No | Review Comment: ```suggestion | **`deleted-duplicate-files`** | Number of duplicate files deleted (duplicates are files recorded more than once in the manifest) | No | ``` ########## format/spec.md: ########## @@ -1633,3 +1633,57 @@ might indicate different snapshot IDs for a specific timestamp. The discrepancie When processing point in time queries implementations should use "snapshot-log" metadata to lookup the table state at the given point in time. This ensures time-travel queries reflect the state of the table at the provided timestamp. For example a SQL query like `SELECT * FROM prod.db.table TIMESTAMP AS OF '1986-10-26 01:21:00Z';` would find the snapshot of the Iceberg table just prior to '1986-10-26 01:21:00 UTC' in the snapshot logs and use the metadata from that snapshot to perform the scan of the table. If no snapshot exists prior to the timestamp given or "snapshot-log" is not populated (it is an optional field), then systems should raise an informative error message about the missing metadata. +## Appendix G: Optional Snapshot Summary Fields + +### Metrics +Snapshot summary can include metrics fields to track numeric stats of the snapshot. The value of these fields should be numeric strings (e.g., `"120"`). +Some of them are also used to represent partition-level metrics, in [Partition-Level Summary](#partition-level-summary). +Metrics must be accurate if written, as engines may rely on them for optimization. Review Comment: nit: wydt about moving this description up one level, between "Appendix G" and "Metrics" since this is describing generally what fields belong to the Snapshot Summary. It should apply to all subheaders ( "Metrics", "Partition-Level", "Other") ```suggestion ## Appendix G: Optional Snapshot Summary Fields Snapshot summary can include metrics fields to track numeric stats of the snapshot, see [Metrics](#metrics). Some of them are also used to represent partition-level metrics, in [Partition-Level Summary](#partition-level-summary). The value of these fields should be of string type (e.g., `"120"`). ### Metrics Metrics must be accurate if written, as engines may rely on them for optimization. ``` ########## format/spec.md: ########## @@ -1633,3 +1633,57 @@ might indicate different snapshot IDs for a specific timestamp. The discrepancie When processing point in time queries implementations should use "snapshot-log" metadata to lookup the table state at the given point in time. This ensures time-travel queries reflect the state of the table at the provided timestamp. For example a SQL query like `SELECT * FROM prod.db.table TIMESTAMP AS OF '1986-10-26 01:21:00Z';` would find the snapshot of the Iceberg table just prior to '1986-10-26 01:21:00 UTC' in the snapshot logs and use the metadata from that snapshot to perform the scan of the table. If no snapshot exists prior to the timestamp given or "snapshot-log" is not populated (it is an optional field), then systems should raise an informative error message about the missing metadata. +## Appendix G: Optional Snapshot Summary Fields + +### Metrics +Snapshot summary can include metrics fields to track numeric stats of the snapshot. The value of these fields should be numeric strings (e.g., `"120"`). +Some of them are also used to represent partition-level metrics, in [Partition-Level Summary](#partition-level-summary). +Metrics must be accurate if written, as engines may rely on them for optimization. + +| Field | Description | Used in Partition-Level Summary | +|-------------------------------------|-------------------------------------------------------------------------------------------------------|---------------------------------| +| **`added-data-files`** | Number of data files added in the snapshot | Yes | +| **`deleted-data-files`** | Number of data files deleted in the snapshot | Yes | +| **`total-data-files`** | Total number of live data files in the snapshot | No | +| **`added-delete-files`** | Number of positional/equality delete files and deletion vectors added in the snapshot | Yes | +| **`added-equality-delete-files`** | Number of equality delete files added in the snapshot | Yes | +| **`removed-equality-delete-files`** | Number of equality delete files removed in the snapshot | Yes | +| **`added-position-delete-files`** | Number of position delete files added in the snapshot | Yes | +| **`removed-position-delete-files`** | Number of position delete files removed in the snapshot | Yes | +| **`added-dvs`** | Number of deletion vectors added in the snapshot | Yes | +| **`removed-dvs`** | Number of deletion vectors removed in the snapshot | Yes | +| **`removed-delete-files`** | Number of positional/equality delete files and deletion vectors removed in the snapshot | Yes | +| **`total-delete-files`** | Total number of live positional/equality delete files and deletion vectors in the snapshot | No | +| **`added-records`** | Number of records added in the snapshot | Yes | +| **`deleted-records`** | Number of records deleted in the snapshot | Yes | +| **`total-records`** | Total number of records in the snapshot | No | +| **`added-files-size`** | The size of files added in the snapshot | Yes | +| **`removed-files-size`** | The size of files removed in the snapshot | Yes | +| **`total-files-size`** | The size of all files in the snapshot | No | Review Comment: ```suggestion | **`total-files-size`** | Total size of files added in the snapshot | No | ``` align with other `total-` descriptions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org