flyrain commented on code in PR #7105: URL: https://github.com/apache/iceberg/pull/7105#discussion_r1332135411
########## format/spec.md: ########## @@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields: | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. | +#### Partition statistics + +Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). +Partition statistics are not required for reading or planning and readers may ignore them. +Each table snapshot may be associated with at most one partition statistic file. +A writer can optionally write the partition statistics file during each write operation, and +it must be registered in the table metadata file to be considered as a valid statistics file for the reader. + +`partition-statistics` field of table metadata is an optional list of struct with the following fields: + +| v1 | v2 | Field name | Type | Description | +|----|----|------------|------|-------------| +| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. | +| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). | +| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. | + +#### Partition Statistics file + +Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC). +These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning. + +The schema of the partition statistics file is as follows: + +| v1 | v2 | Field id, name | Type | Description | +|----|----|----------------|------|-------------| +| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table | +| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id | +| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files | +| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files | Review Comment: Not a native speaker, so I searched around. Seems `file count`, `record count` is the right way to go. > The reason "file count" is the correct phrase is because it follows the standard rules of English grammar for compound nouns. When you have a compound noun made up of two nouns, like "file" and "count," the first noun (in this case, "file") acts as an adjective describing the second noun (in this case, "count"). > So, "file count" means the count of files, or in other words, it specifies what kind of count you are referring to – a count of files. This is a common construction in English, where the first noun helps specify or describe the second noun, and it's the reason "file count" is used rather than "files count." ########## format/spec.md: ########## @@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields: | _optional_ | _optional_ | **`properties`** | `map<string, string>` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. | +#### Partition statistics + +Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). +Partition statistics are not required for reading or planning and readers may ignore them. +Each table snapshot may be associated with at most one partition statistic file. +A writer can optionally write the partition statistics file during each write operation, and +it must be registered in the table metadata file to be considered as a valid statistics file for the reader. + +`partition-statistics` field of table metadata is an optional list of struct with the following fields: + +| v1 | v2 | Field name | Type | Description | +|----|----|------------|------|-------------| +| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. | +| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). | +| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the partition statistics file. | + +#### Partition Statistics file + +Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC). +These rows must be sorted (in ascending manner with NULL FIRST) by `partition` field to optimize filtering rows while scanning. + +The schema of the partition statistics file is as follows: + +| v1 | v2 | Field id, name | Type | Description | +|----|----|----------------|------|-------------| +| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data tuple, schema based on the unified partition type considering all specs in a table | +| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id | +| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files | +| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files | Review Comment: Not a native speaker, so I searched around. Seems `file count`, `record count` is the right way to go. > The reason "file count" is the correct phrase is because it follows the standard rules of English grammar for compound nouns. When you have a compound noun made up of two nouns, like "file" and "count," the first noun (in this case, "file") acts as an adjective describing the second noun (in this case, "count"). > So, "file count" means the count of files, or in other words, it specifies what kind of count you are referring to – a count of files. This is a common construction in English, where the first noun helps specify or describe the second noun, and it's the reason "file count" is used rather than "files count." -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
