emkornfield commented on code in PR #14234: URL: https://github.com/apache/iceberg/pull/14234#discussion_r3268221874
########## format/spec.md: ########## @@ -704,11 +727,133 @@ Examples of valid field paths using normalized JSON path format are: * `$['tags']` -- the `tags` array * `$['addresses']['zip']` -- the `zip` field in an `addresses` array that contains objects -For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are both points of the following coordinates X, Y, Z, and M (see Appendix G) which are the lower / upper bound of all objects in the file. +##### Content Stats -For `geography` only, xmin (X value of `lower_bounds`) may be greater than xmax (X value of `upper_bounds`), in which case an object in this bounding box may match if it contains an X such that x >= xmin OR x <= xmax. In geographic terminology, the concepts of xmin, xmax, ymin, and ymax are also known as westernmost, easternmost, southernmost and northernmost, respectively. These points are further restricted to the canonical ranges of [-180..180] for X and [-90..90] for Y. +In Iceberg v4, statistics are stored in typed fields grouped in a struct that corresponds to the table field. These stats structs are nested within the `content_stats` struct in manifest files. -When calculating upper and lower bounds for `geometry` and `geography`, null or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) contributes a value to X but no values to Y, Z, or M dimension bounds. If a dimension has only null or NaN values, that dimension is omitted from the bounding box. If either the X or Y dimension is missing then the bounding box itself is not produced. +###### Field Statistics + +Field-level structs in `content_stats` are based on the corresponding table field's type, requirement, and ID (`field-id`). + +Field stats structs are assigned a range of 200 IDs, starting at `10_000 + 200 * field-id`. The first ID in the range (`base-id`) is the ID of the struct field in `content_stats`. Fields within the stats struct are assigned IDs from the range by adding an offset to the `base-id`. For example, the stats struct for table field 2 uses IDs in the range `[10_400, 10_599]`, the field within `content_stats` uses the `base-id`, ID `10_400`, and its `lower_bound` field (offset 1) uses ID `10_401`. + +Content stats must be resolved by ID; field names used for stats structs are informational. The recommended name for each field is the full name of the field in the table schema. + +IDs in the range `10_000` (inclusive) to `200_000_000` (exclusive) are reserved for column stats structs in `content_stats`. Stats for table fields with stats IDs outside that range cannot be stored in `content_stats`. + +[Reserved metadata fields](#reserved-field-ids) must use the stats ID ranges from the following table. Stats for metadata fields not in the table are not tracked. + +| Reserved field | ID | `base-id` | Range end | +|---------------------------------|------------|-----------|-----------| +| `_last_updated_sequence_number` | 2147483539 | 9000 | 9199 | +| `_row_id` | 2147483540 | 9200 | 9399 | + +Each stats struct holds statistics for one table field. It may contain the following metrics: + +| Requirement | Offset | Name | Type | Included for | Description | +|-------------|--------|---------------------------|---------------------------|-----------------------------------------------|-------------| +| _optional_ | 1 | `lower_bound` | Field type or `geo_lower` | all primitives or `variant` | Lower bound stored as the field's type, or `geo_lower` for geo types | +| _optional_ | 2 | `upper_bound` | Field type or `geo_upper` | all primitives or `variant` | Upper bound stored as the field's type, or `geo_upper` for geo types | +| _optional_ | 3 | `tight_bounds` | `boolean` | all except `geometry`, `geography`, `variant` | When true, `lower_bound` and `upper_bound` must be equal to the min and max values | +| _optional_ | 4 | `value_count` | `long` | all | Number of values in the column (including null and NaN values) | +| _optional_ | 5 | `null_value_count` | `long` | optional fields | Number of null values in the column | +| _optional_ | 6 | `nan_value_count` | `long` | `float`, `double` | Number of NaN values in the column | +| _optional_ | 7 | `avg_value_size_in_bytes` | `int` | `string`, `binary`, `variant` | Avg value size (uncompressed) in bytes to estimate memory consumption | Review Comment: > This uses "uncompressed" to mean "unencoded and uncompressed". I don't see a reason to distinguish between encoding and compression here. Encoding and compression are distinct phases or layers within Parquet, but to Iceberg there are only two sizes: the size of values in memory (uncompressed) and the size the column takes on disk (compressed). I don't think that this is unclear, but we can state that this is intended to be a way to estimate the size values will actually take in memory -- that's what makes it useful. I think this is unclear because of naming in parquet, which has fields named [`total_uncompressed_size`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L898) which is actually the encoded size, and naive implementations might just copy that field. The phrasing on parquet for [`unencoded_byte_array_data_bytes`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L204) might be a good place to draw inspiration from for a more precise definition if maps to what we are aiming for here. "Size in memory" is actually pretty ambiguous because it depends on the memory model using to represent the values (e.g. Apache Arrow has three different representations which can have significantly different memory profiles depending on data distribution, Java strings have quite a bit of overhead, etc). Some memory models might actually be dictionary encoding, etc. > However, I don't think this would work for shredded variants though Yes, we would probably want to sum of the sizes of all shredded + unshredded values, not clear on how to account for the "metadata field" but for consistency, we probably just want to sum the total bytes taken for those as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
