nastra commented on code in PR #14234:
URL: https://github.com/apache/iceberg/pull/14234#discussion_r3199551217
##########
format/spec.md:
##########
@@ -707,6 +714,131 @@ For `geography` only, xmin (X value of `lower_bounds`)
may be greater than xmax
When calculating upper and lower bounds for `geometry` and `geography`, null
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN)
contributes a value to X but no values to Y, Z, or M dimension bounds. If a
dimension has only null or NaN values, that dimension is omitted from the
bounding box. If either the X or Y dimension is missing then the bounding box
itself is not produced.
+##### Content Stats
+
+In Iceberg v4 stats have been redesigned and are represented by using nested
structs (`struct<struct<...>>`). The statistics for fields are tracked inside a
nested struct of value counts and bounds (described in the next section). Each
field-level statistics struct is a field of the `content_stats` struct, which
holds all statistics for table fields.
+
+###### ID assignment for stats fields
+
+ID assignment follows a deterministic mapping from the **table ID space** to
the **stats ID space**, where a given field ID from the **table ID space** gets
an ID assigned from the **stats ID space** for each field-level statistics
struct.
+Each field-level statistic listed in the [field stats types
section](#field-stats-types) has a fixed offset. Its stats field ID is the
enclosing stats struct's ID plus that offset.
+
+**Data columns (normal table field ids)**
+Mapping a table field ID from the **table ID space** to the **stats ID space**
is done via:
+
+`stats_struct_id = 10_000 + (200 * table_field_id)`
+
+The constant `10_000` is `stats_space_field_id_start_for_data_fields`. `200`
represents the number of supports stats per column
(`num_supported_stats_per_column = 200`).
+
+The formula is defined as:
+`stats_struct_id = stats_space_field_id_start_for_data_fields +
(num_supported_stats_per_column * table_field_id)`
+
+Each field statistic listed under [Field stats types](#field-stats-types) has
a fixed **offset** within that block. The field id for an individual field
statistic is:
+
+`stats_field_id = stats_struct_id + offset`
+
+**Metadata columns (reserved table field ids)**
+
+[Reserved metadata fields](#reserved-field-ids) use a different starting base
for their stats field ids in order to not overlap with data field stats ids.
Mapping a reserved table field ID to the **stats ID space** is done via:
+
+`stats_struct_id = 2_147_000_000 + (200 * (200 - (Integer.MAX_VALUE -
table_field_id)))`
+
+Here `2_147_000_000` is `stats_space_field_id_start_for_metadata_fields`. This
separate base is required because reserved ids are near `Integer.MAX_VALUE` and
cannot use the same linear mapping as data field ids.
+The first `200` refers to `num_supported_stats_per_column = 200` and the
second `200` refers to `num_reserved_field_ids = 200` from [Reserved field
ids](#reserved-field-ids).
+
+The formula is defined as:
+`stats_struct_id = stats_space_field_id_start_for_metadata_fields +
(num_supported_stats_per_column * (num_reserved_field_ids - (Integer.MAX_VALUE
- table_field_id)))`
+
+Valid data field ids support stats structs with ids from `10_000` through
`200_010_000`, so the highest supported **data** field id is `1_000_000`.
+
+###### Name assignment for `content_stats` fields
+
+Each nested stats struct is a **child field** of the root `content_stats`
struct. Its **name** is the numerical string of the table column's field id
(for example id `103` uses the name `"103"`).
+Its **field id** is deterministically calculated as defined in the previous
section. The name is informational and readers must resolve content stats by ID.
+
+###### Field stats types
+
+Each stats struct holds statistics for one table column. It may contain the
following metrics:
+
+| required/optional | Offset | Name | Type |
included for | Description
|
+|-------------------|--------|-------------------------|---------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _optional_ | 1 | value_count | `long` |
all types | Number of values in the column (including null and
NaN values)
|
+| _optional_ | 2 | null_value_count | `long` |
optional columns only | Number of null values in the column. Only included
for optional columns
|
+| _optional_ | 3 | nan_value_count | `long` |
float/double types | Number of NaN values in the column. Only included for
float/double types. NaN rules follow note 2 under [Data File
Fields](#data-file-fields)
|
+| _optional_ | 4 | avg_value_size_in_bytes | `int` |
variable-length types | Avg stored (compressed, encoded) value size in bytes
for variable-length types (`string` / `binary` / `variant`)
|
+| _optional_ | 5 | max_value_size_in_bytes | `int` |
variable-length types | Max stored (compressed, encoded) value size in bytes
for variable-length types (`string` / `binary` / `variant`)
|
+| _optional_ | 6 | lower_bound | type of table field |
all types | Lower bound serialized as the column's type. Bounds
follow rules defined in [Bounds for Variant, Geometry, and
Geography](#bounds-for-variant-geometry-and-geography)
|
+| _optional_ | 7 | upper_bound | type of table field |
all types | Upper bound serialized as the column's type. Bounds
follow rules defined in [Bounds for Variant, Geometry, and
Geography](#bounds-for-variant-geometry-and-geography)
|
+| _optional_ | 8 | exact_bounds | `boolean` |
truncated/inexact types | Whether the `lower_bound` / `upper_bound` are exact
(`true`) or may be truncated or otherwise inexact (`false`). Defaults to
`true`. Types such as `string` / `binary` often use `false` when bounds are
truncated. For types with inherently exact bounds when written (for example
boolean, integer, floating-point, date, time, timestamp, decimal, uuid,
geometry, geography), writers should use `true` when bounds are present. If a
deletion vector or equality delete file can match rows in the data file,
implementations must treat bounds as inexact for pruning (`exact_bounds` as
`false`) |
+
+###### Stats projection
+
+To retrieve stats for a particular table field ID, one would always project by
stats ID, where the stats ID for a given table field ID can be calculated by
applying the reverse calculation.
+For data columns the reverse calculation would be:
+
+`table_field_id = (stats_struct_id - 10_000) / 200`
+
+For [reserved field IDs](#reserved-field-ids), the reverse calculation would
be:
+
+`table_field_id = stats_struct_id - 200 + (Integer.MAX_VALUE -
stats_struct_id) + (stats_struct_id - 2_147_000_000) / 200`
+
+using `num_reserved_field_ids = 200`,
`stats_space_field_id_start_for_metadata_fields = 2_147_000_000`, and
`num_supported_stats_per_column = 200` (see [ID assignment for stats
fields](#id-assignment-for-stats-fields)).
+
+The formula is defined as:
+`table_field_id = stats_struct_id - num_reserved_field_ids +
(Integer.MAX_VALUE - stats_struct_id) + (stats_struct_id -
stats_space_field_id_start_for_metadata_fields) /
num_supported_stats_per_column`
+
+Below are examples for some table field ID -> stats struct id calculations.
+
+| Table Field ID | Stats ID of Stats struct |
+|---------------------|---------------------------|
+| 0 | 10_000 |
+| 1 | 10_200 |
+| 2 | 10_400 |
+| 5 | 11_000 |
+| 100 | 30_000 |
+| 1_000_000 | 200_010_000 |
+
+| Reserved Field ID | Stats ID of Stats struct |
+|---------------------|---------------------------|
+| 2_147_483_447 | 2_147_000_000 |
+| 2_147_483_448 | 2_147_000_200 |
+| 2_147_483_541 | 2_147_018_800 |
+| 2_147_483_645 | 2_147_039_600 |
+| 2_147_483_646 | 2_147_039_800 |
+
+The below table shows the stats IDs of individual field statistics, which are
calculated based on the offset that is described in the [Field stats types
section](#field-stats-types)
+
+| Table Field ID | Stats ID of Stats struct | Stats Type | Stats
ID of individual statistic |
+|----------------|--------------------------|-------------------------|----------------------------------|
+| 2 | 10_400 | value_count | 10_401
|
+| | | null_value_count | 10_402
|
+| | | nan_value_count | 10_403
|
+| | | avg_value_size_in_bytes | 10_404
|
+| | | max_value_size_in_bytes | 10_405
|
+| | | lower_bound | 10_406
|
+| | | upper_bound | 10_407
|
+| | | exact_bounds | 10_408
|
+| 5 | 11_000 | value_count | 11_001
|
+| | | null_value_count | 11_002
|
+| | | nan_value_count | 11_003
|
+| | | avg_value_size_in_bytes | 11_004
|
+| | | max_value_size_in_bytes | 11_005
|
+| | | lower_bound | 11_006
|
+| | | upper_bound | 11_007
|
+| | | exact_bounds | 11_008
|
+
+###### Manifest schema and `content_stats` typing
+
+The `content_stats` type is dynamically derived from the table schema and
produces one nested stats struct per table column, which is then combined into
the root `content_stats` struct.
+That derived type is embedded in the manifest for `data_file` at the field
reserved for `content_stats` in v4.
+Writers and readers must use this **same** manifest schema both when writing
and when reading manifest files for the table.
+
+Using one schema for read and write is what allows **type promotion** on
stats: if an `int` column `x` is promoted to `long`, the nested stats struct
changes from `struct<..., lower_bound int, upper_bound int, ...>` to
`struct<..., lower_bound long, upper_bound long, ...>`.
+Reading an older manifest applies normal Iceberg type promotion to those bound
fields; writing after promotion then uses the promoted struct type, so
round-trips stay consistent.
+
+###### Content stats aggregation
+TBD...
Review Comment:
I'm planning to add this but didn't have time yet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]