rdblue commented on code in PR #14234: URL: https://github.com/apache/iceberg/pull/14234#discussion_r3197600945
########## format/spec.md: ########## @@ -707,6 +714,131 @@ For `geography` only, xmin (X value of `lower_bounds`) may be greater than xmax When calculating upper and lower bounds for `geometry` and `geography`, null or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) contributes a value to X but no values to Y, Z, or M dimension bounds. If a dimension has only null or NaN values, that dimension is omitted from the bounding box. If either the X or Y dimension is missing then the bounding box itself is not produced. +##### Content Stats + +In Iceberg v4 stats have been redesigned and are represented by using nested structs (`struct<struct<...>>`). The statistics for fields are tracked inside a nested struct of value counts and bounds (described in the next section). Each field-level statistics struct is a field of the `content_stats` struct, which holds all statistics for table fields. + +###### ID assignment for stats fields + +ID assignment follows a deterministic mapping from the **table ID space** to the **stats ID space**, where a given field ID from the **table ID space** gets an ID assigned from the **stats ID space** for each field-level statistics struct. +Each field-level statistic listed in the [field stats types section](#field-stats-types) has a fixed offset. Its stats field ID is the enclosing stats struct's ID plus that offset. + +**Data columns (normal table field ids)** +Mapping a table field ID from the **table ID space** to the **stats ID space** is done via: + +`stats_struct_id = 10_000 + (200 * table_field_id)` + +The constant `10_000` is `stats_space_field_id_start_for_data_fields`. `200` represents the number of supports stats per column (`num_supported_stats_per_column = 200`). + +The formula is defined as: +`stats_struct_id = stats_space_field_id_start_for_data_fields + (num_supported_stats_per_column * table_field_id)` + +Each field statistic listed under [Field stats types](#field-stats-types) has a fixed **offset** within that block. The field id for an individual field statistic is: + +`stats_field_id = stats_struct_id + offset` + +**Metadata columns (reserved table field ids)** + +[Reserved metadata fields](#reserved-field-ids) use a different starting base for their stats field ids in order to not overlap with data field stats ids. Mapping a reserved table field ID to the **stats ID space** is done via: + +`stats_struct_id = 2_147_000_000 + (200 * (200 - (Integer.MAX_VALUE - table_field_id)))` + +Here `2_147_000_000` is `stats_space_field_id_start_for_metadata_fields`. This separate base is required because reserved ids are near `Integer.MAX_VALUE` and cannot use the same linear mapping as data field ids. +The first `200` refers to `num_supported_stats_per_column = 200` and the second `200` refers to `num_reserved_field_ids = 200` from [Reserved field ids](#reserved-field-ids). + +The formula is defined as: +`stats_struct_id = stats_space_field_id_start_for_metadata_fields + (num_supported_stats_per_column * (num_reserved_field_ids - (Integer.MAX_VALUE - table_field_id)))` + +Valid data field ids support stats structs with ids from `10_000` through `200_010_000`, so the highest supported **data** field id is `1_000_000`. Review Comment: I don't like that this puts an arbitrary limit on table field IDs that was not previously in the spec. It is easy to interpret this as "there can be no table field ID higher than 1,000,000". What we actually want to do is to allocate a range of stats field IDs that correspond to the first 1,000,000 (ish) fields. That way, we don't impose a new limit on the number of fields in a table. We just impose a limit on the fields that can be represented in the new stats structure. To do that, I think we should change this to say that IDs in metadata files between 10,000 (inclusive) and 200,000,000 (exclusive) are reserved for column stats structs. If a table field has an ID that would be outside of that range, then it cannot store stats but is still valid. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
