[GitHub] [iceberg] zhongyujiang commented on a diff in pull request #7105: Spec: Add partition stats spec

via GitHub Tue, 06 Jun 2023 18:59:12 -0700


zhongyujiang commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1220695681



##########
format/spec.md:
##########
@@ -702,6 +703,44 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are the valid files based on [Partition Statistics 
file format](#partition-statistics-file-format). Partition statistics are 
informational. A reader can choose to
+ignore partition statistics information. Partition statistics support is not 
required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write 
operation. If the statistics file is written for the specific snapshot, 
+It must be registered in the table metadata file to be considered as a valid 
statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table 
metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg 
table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of 
the partition statistics file. See [Partition Statistics file 
format](#partition-statistics-file-format). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum 
data sequence number of the Iceberg table's snapshot the partition statistics 
was computed from. |
+
+#### Partition Statistics file format
+
+This is a specification for partition statistics files. It is designed to 
store statistics information
+for every partition value as a row in the **table default format** sorted 
based on the first partition column from `partition`.
+
+Partition statistics file store the statistics as a struct with the following 
fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | See 
[PartitionData](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/PartitionData.java)
 |

Review Comment:
   Thanks for explaining, I previously thought this is the same as 
`PartitionData` in `DataFile`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] zhongyujiang commented on a diff in pull request #7105: Spec: Add partition stats spec

Reply via email to