Re: [PR] Spec: Add content stats to spec [iceberg]

via GitHub Tue, 12 May 2026 15:07:16 -0700


stevenzwu commented on code in PR #14234:
URL: https://github.com/apache/iceberg/pull/14234#discussion_r3230008022



##########
format/spec.md:
##########
@@ -704,11 +727,111 @@ Examples of valid field paths using normalized JSON path 
format are:
 * `$['tags']` -- the `tags` array
 * `$['addresses']['zip']` -- the `zip` field in an `addresses` array that 
contains objects
 
-For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are 
both points of the following coordinates X, Y, Z, and M (see Appendix G) which 
are the lower / upper bound of all objects in the file.
+##### Content Stats
 
-For `geography` only, xmin (X value of `lower_bounds`) may be greater than 
xmax (X value of `upper_bounds`), in which case an object in this bounding box 
may match if it contains an X such that x >= xmin OR x <= xmax. In geographic 
terminology, the concepts of xmin, xmax, ymin, and ymax are also known as 
westernmost, easternmost, southernmost and northernmost, respectively. These 
points are further restricted to the canonical ranges of [-180..180] for X and 
[-90..90] for Y.
+In Iceberg v4, statistics are stored in typed fields grouped in a struct that 
corresponds to the table field. These stats structs are nested within the 
`content_stats` struct in manifest files.
 
-When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
+###### Field Statistics
+
+Field-level structs in `content_stats` are based on the corresponding table 
field's type, requirement, and ID (`field-id`).
+
+Field stats structs are assigned a range of 200 IDs, starting at `10_000 + 200 
* field-id`. The first ID in the range (`base-id`) is the ID of the struct 
field in `content_stats`. Fields within the stats struct are assigned IDs from 
the range by adding an offset to the `base-id`. For example, the stats struct 
for table field 2 uses IDs in the range `[10_400, 10_599]`, the field within 
`content_stats` has uses the `base-id`, ID `10_400`, and its `lower_bound` 
field (offset 1) uses ID `10_401`.
+
+Content stats must be resovled by ID; field names used for stats structs are 
informational. The recommended name for each field is the full name of the 
field in the table schema.

Review Comment:
   Typo: `resovled` should be `resolved`.
   
   ```suggestion
   Content stats must be resolved by ID; field names used for stats structs are 
informational. The recommended name for each field is the full name of the 
field in the table schema.
   ```



##########
format/spec.md:
##########
@@ -704,11 +727,111 @@ Examples of valid field paths using normalized JSON path 
format are:
 * `$['tags']` -- the `tags` array
 * `$['addresses']['zip']` -- the `zip` field in an `addresses` array that 
contains objects
 
-For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are 
both points of the following coordinates X, Y, Z, and M (see Appendix G) which 
are the lower / upper bound of all objects in the file.
+##### Content Stats
 
-For `geography` only, xmin (X value of `lower_bounds`) may be greater than 
xmax (X value of `upper_bounds`), in which case an object in this bounding box 
may match if it contains an X such that x >= xmin OR x <= xmax. In geographic 
terminology, the concepts of xmin, xmax, ymin, and ymax are also known as 
westernmost, easternmost, southernmost and northernmost, respectively. These 
points are further restricted to the canonical ranges of [-180..180] for X and 
[-90..90] for Y.
+In Iceberg v4, statistics are stored in typed fields grouped in a struct that 
corresponds to the table field. These stats structs are nested within the 
`content_stats` struct in manifest files.
 
-When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
+###### Field Statistics
+
+Field-level structs in `content_stats` are based on the corresponding table 
field's type, requirement, and ID (`field-id`).
+
+Field stats structs are assigned a range of 200 IDs, starting at `10_000 + 200 
* field-id`. The first ID in the range (`base-id`) is the ID of the struct 
field in `content_stats`. Fields within the stats struct are assigned IDs from 
the range by adding an offset to the `base-id`. For example, the stats struct 
for table field 2 uses IDs in the range `[10_400, 10_599]`, the field within 
`content_stats` has uses the `base-id`, ID `10_400`, and its `lower_bound` 
field (offset 1) uses ID `10_401`.

Review Comment:
   Typo: `has uses` should be `uses`.
   
   ```suggestion
   Field stats structs are assigned a range of 200 IDs, starting at `10_000 + 
200 * field-id`. The first ID in the range (`base-id`) is the ID of the struct 
field in `content_stats`. Fields within the stats struct are assigned IDs from 
the range by adding an offset to the `base-id`. For example, the stats struct 
for table field 2 uses IDs in the range `[10_400, 10_599]`, the field within 
`content_stats` uses the `base-id`, ID `10_400`, and its `lower_bound` field 
(offset 1) uses ID `10_401`.
   ```



##########
format/spec.md:
##########
@@ -704,11 +727,111 @@ Examples of valid field paths using normalized JSON path 
format are:
 * `$['tags']` -- the `tags` array
 * `$['addresses']['zip']` -- the `zip` field in an `addresses` array that 
contains objects
 
-For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are 
both points of the following coordinates X, Y, Z, and M (see Appendix G) which 
are the lower / upper bound of all objects in the file.
+##### Content Stats
 
-For `geography` only, xmin (X value of `lower_bounds`) may be greater than 
xmax (X value of `upper_bounds`), in which case an object in this bounding box 
may match if it contains an X such that x >= xmin OR x <= xmax. In geographic 
terminology, the concepts of xmin, xmax, ymin, and ymax are also known as 
westernmost, easternmost, southernmost and northernmost, respectively. These 
points are further restricted to the canonical ranges of [-180..180] for X and 
[-90..90] for Y.
+In Iceberg v4, statistics are stored in typed fields grouped in a struct that 
corresponds to the table field. These stats structs are nested within the 
`content_stats` struct in manifest files.
 
-When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
+###### Field Statistics
+
+Field-level structs in `content_stats` are based on the corresponding table 
field's type, requirement, and ID (`field-id`).
+
+Field stats structs are assigned a range of 200 IDs, starting at `10_000 + 200 
* field-id`. The first ID in the range (`base-id`) is the ID of the struct 
field in `content_stats`. Fields within the stats struct are assigned IDs from 
the range by adding an offset to the `base-id`. For example, the stats struct 
for table field 2 uses IDs in the range `[10_400, 10_599]`, the field within 
`content_stats` has uses the `base-id`, ID `10_400`, and its `lower_bound` 
field (offset 1) uses ID `10_401`.
+
+Content stats must be resovled by ID; field names used for stats structs are 
informational. The recommended name for each field is the full name of the 
field in the table schema.
+
+IDs in the range `10_000` (inclusive) to `200_000_000` (exclusive) are 
reserved for column stats structs in `content_stats`. Stats for table fields 
with IDs outside that range cannot be stored in `content_stats`.
+
+[Reserved metadata fields](#reserved-field-ids) must use the stats ID ranges 
from the following table. Stats for metadata fields not in the table are not 
tracked.
+
+| Reserved field                  | ID         | `base-id` | Range end |
+|---------------------------------|------------|-----------|-----------|
+| `_last_updated_sequence_number` | 2147483539 | 9000      | 9199 |
+| `_row_id`                       | 2147483540 | 9200      | 9399 |
+
+Each stats struct holds statistics for one table field. It may contain the 
following metrics:
+
+| Requirement | Offset | Name                      | Type                      
| Included for                                  | Description |
+|-------------|--------|---------------------------|---------------------------|-----------------------------------------------|-------------|
+| _optional_  | 1      | `lower_bound`             | Field type or `geo_lower` 
| all primitives or `variant`                   | Lower bound stored as the 
field's type, or `geo_lower` for geo types |
+| _optional_  | 2      | `upper_bound`             | Field type or `geo_upper` 
| all primitives or `variant`                   | Upper bound stored as the 
field's type, or `geo_upper` for geo types |
+| _optional_  | 3      | `tight_bounds`            | `boolean`                 
| all except `geometry`, `geography`, `variant` | When true, `lower_bound` and 
`upper_bound` must be equal to the min and max values |

Review Comment:
   The description states what `true` means, but is silent on the writer 
contract for naturally-exact types (int, long, date, timestamps, etc.). For 
those types, must writers set `tight_bounds=true` when bounds are exact, or is 
omission also valid? Pulling reader-side pruning rules out of the description 
was the right call, but a one-line note about writer obligations would close 
the loop. Otherwise readers cannot distinguish "writer didn't bother" from 
"bounds are not tight."



##########
format/spec.md:
##########
@@ -704,11 +727,111 @@ Examples of valid field paths using normalized JSON path 
format are:
 * `$['tags']` -- the `tags` array
 * `$['addresses']['zip']` -- the `zip` field in an `addresses` array that 
contains objects
 
-For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are 
both points of the following coordinates X, Y, Z, and M (see Appendix G) which 
are the lower / upper bound of all objects in the file.
+##### Content Stats
 
-For `geography` only, xmin (X value of `lower_bounds`) may be greater than 
xmax (X value of `upper_bounds`), in which case an object in this bounding box 
may match if it contains an X such that x >= xmin OR x <= xmax. In geographic 
terminology, the concepts of xmin, xmax, ymin, and ymax are also known as 
westernmost, easternmost, southernmost and northernmost, respectively. These 
points are further restricted to the canonical ranges of [-180..180] for X and 
[-90..90] for Y.
+In Iceberg v4, statistics are stored in typed fields grouped in a struct that 
corresponds to the table field. These stats structs are nested within the 
`content_stats` struct in manifest files.
 
-When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
+###### Field Statistics
+
+Field-level structs in `content_stats` are based on the corresponding table 
field's type, requirement, and ID (`field-id`).
+
+Field stats structs are assigned a range of 200 IDs, starting at `10_000 + 200 
* field-id`. The first ID in the range (`base-id`) is the ID of the struct 
field in `content_stats`. Fields within the stats struct are assigned IDs from 
the range by adding an offset to the `base-id`. For example, the stats struct 
for table field 2 uses IDs in the range `[10_400, 10_599]`, the field within 
`content_stats` has uses the `base-id`, ID `10_400`, and its `lower_bound` 
field (offset 1) uses ID `10_401`.
+
+Content stats must be resovled by ID; field names used for stats structs are 
informational. The recommended name for each field is the full name of the 
field in the table schema.
+
+IDs in the range `10_000` (inclusive) to `200_000_000` (exclusive) are 
reserved for column stats structs in `content_stats`. Stats for table fields 
with IDs outside that range cannot be stored in `content_stats`.
+
+[Reserved metadata fields](#reserved-field-ids) must use the stats ID ranges 
from the following table. Stats for metadata fields not in the table are not 
tracked.
+
+| Reserved field                  | ID         | `base-id` | Range end |
+|---------------------------------|------------|-----------|-----------|
+| `_last_updated_sequence_number` | 2147483539 | 9000      | 9199 |
+| `_row_id`                       | 2147483540 | 9200      | 9399 |
+
+Each stats struct holds statistics for one table field. It may contain the 
following metrics:
+
+| Requirement | Offset | Name                      | Type                      
| Included for                                  | Description |
+|-------------|--------|---------------------------|---------------------------|-----------------------------------------------|-------------|
+| _optional_  | 1      | `lower_bound`             | Field type or `geo_lower` 
| all primitives or `variant`                   | Lower bound stored as the 
field's type, or `geo_lower` for geo types |
+| _optional_  | 2      | `upper_bound`             | Field type or `geo_upper` 
| all primitives or `variant`                   | Upper bound stored as the 
field's type, or `geo_upper` for geo types |
+| _optional_  | 3      | `tight_bounds`            | `boolean`                 
| all except `geometry`, `geography`, `variant` | When true, `lower_bound` and 
`upper_bound` must be equal to the min and max values |
+| _optional_  | 4      | `value_count`             | `long`                    
| all                                           | Number of values in the 
column (including null and NaN values) |
+| _optional_  | 5      | `null_value_count`        | `long`                    
| optional fields                               | Number of null values in the 
column |
+| _optional_  | 6      | `nan_value_count`         | `long`                    
| `float`, `double`                             | Number of NaN values in the 
column |
+| _optional_  | 7      | `avg_value_size_in_bytes` | `int`                     
| `string`, `binary`, `variant`                 | Avg value size (uncompressed) 
in bytes to estimate memory consumption |
+
+For example, stats for a `required` `int` field named `id` with field-id `2` 
are stored using:
+
+```
+10_400: optional struct id (default null) {
+  10_401: optional int lower_bound; // type matches the field type (int)
+  10_402: optional int upper_bound; // type matches the field type (int)
+  10_403: optional boolean tight_bounds;
+  10_404: optional long value_count;
+
+  // null_value_count is only used for optional fields
+  // nan_value_count is only used for float and double
+  // avg_value_size_in_bytes is only used for variable length types
+}
+```
+
+If any field is missing from the struct, readers must assume that it is 
unknown.
+
+Lower and upper bounds for `geometry` and `geography` columns are XYZM points 
that define a bounding box, stored in `geo_lower` and `geo_upper` structs (see 
[Bounds for Geometry and Geography](#bounds-for-geometry-and-geography). IDs 
used by geo structs are assigned using offsets in the table field's stats ID 
range.

Review Comment:
   Missing closing parenthesis before the period.
   
   ```suggestion
   Lower and upper bounds for `geometry` and `geography` columns are XYZM 
points that define a bounding box, stored in `geo_lower` and `geo_upper` 
structs (see [Bounds for Geometry and 
Geography](#bounds-for-geometry-and-geography)). IDs used by geo structs are 
assigned using offsets in the table field's stats ID range.
   ```



##########
format/spec.md:
##########
@@ -704,11 +727,111 @@ Examples of valid field paths using normalized JSON path 
format are:
 * `$['tags']` -- the `tags` array
 * `$['addresses']['zip']` -- the `zip` field in an `addresses` array that 
contains objects
 
-For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are 
both points of the following coordinates X, Y, Z, and M (see Appendix G) which 
are the lower / upper bound of all objects in the file.
+##### Content Stats
 
-For `geography` only, xmin (X value of `lower_bounds`) may be greater than 
xmax (X value of `upper_bounds`), in which case an object in this bounding box 
may match if it contains an X such that x >= xmin OR x <= xmax. In geographic 
terminology, the concepts of xmin, xmax, ymin, and ymax are also known as 
westernmost, easternmost, southernmost and northernmost, respectively. These 
points are further restricted to the canonical ranges of [-180..180] for X and 
[-90..90] for Y.
+In Iceberg v4, statistics are stored in typed fields grouped in a struct that 
corresponds to the table field. These stats structs are nested within the 
`content_stats` struct in manifest files.
 
-When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
+###### Field Statistics
+
+Field-level structs in `content_stats` are based on the corresponding table 
field's type, requirement, and ID (`field-id`).
+
+Field stats structs are assigned a range of 200 IDs, starting at `10_000 + 200 
* field-id`. The first ID in the range (`base-id`) is the ID of the struct 
field in `content_stats`. Fields within the stats struct are assigned IDs from 
the range by adding an offset to the `base-id`. For example, the stats struct 
for table field 2 uses IDs in the range `[10_400, 10_599]`, the field within 
`content_stats` has uses the `base-id`, ID `10_400`, and its `lower_bound` 
field (offset 1) uses ID `10_401`.
+
+Content stats must be resovled by ID; field names used for stats structs are 
informational. The recommended name for each field is the full name of the 
field in the table schema.
+
+IDs in the range `10_000` (inclusive) to `200_000_000` (exclusive) are 
reserved for column stats structs in `content_stats`. Stats for table fields 
with IDs outside that range cannot be stored in `content_stats`.
+
+[Reserved metadata fields](#reserved-field-ids) must use the stats ID ranges 
from the following table. Stats for metadata fields not in the table are not 
tracked.
+
+| Reserved field                  | ID         | `base-id` | Range end |
+|---------------------------------|------------|-----------|-----------|
+| `_last_updated_sequence_number` | 2147483539 | 9000      | 9199 |
+| `_row_id`                       | 2147483540 | 9200      | 9399 |
+
+Each stats struct holds statistics for one table field. It may contain the 
following metrics:
+
+| Requirement | Offset | Name                      | Type                      
| Included for                                  | Description |
+|-------------|--------|---------------------------|---------------------------|-----------------------------------------------|-------------|
+| _optional_  | 1      | `lower_bound`             | Field type or `geo_lower` 
| all primitives or `variant`                   | Lower bound stored as the 
field's type, or `geo_lower` for geo types |
+| _optional_  | 2      | `upper_bound`             | Field type or `geo_upper` 
| all primitives or `variant`                   | Upper bound stored as the 
field's type, or `geo_upper` for geo types |
+| _optional_  | 3      | `tight_bounds`            | `boolean`                 
| all except `geometry`, `geography`, `variant` | When true, `lower_bound` and 
`upper_bound` must be equal to the min and max values |
+| _optional_  | 4      | `value_count`             | `long`                    
| all                                           | Number of values in the 
column (including null and NaN values) |
+| _optional_  | 5      | `null_value_count`        | `long`                    
| optional fields                               | Number of null values in the 
column |
+| _optional_  | 6      | `nan_value_count`         | `long`                    
| `float`, `double`                             | Number of NaN values in the 
column |
+| _optional_  | 7      | `avg_value_size_in_bytes` | `int`                     
| `string`, `binary`, `variant`                 | Avg value size (uncompressed) 
in bytes to estimate memory consumption |
+
+For example, stats for a `required` `int` field named `id` with field-id `2` 
are stored using:

Review Comment:
   it might be better to move the example toward the end of this section and it 
would help to spell out the ID assignment for geo more explicitly. 
   
   The int and string cases are straightforward, but the geo case has two 
non-obvious twists: `lower_bound`/`upper_bound` are themselves structs, and the 
inner `x`/`y`/`z`/`m` IDs are assigned from the parent's range (offsets 10–17), 
not from a fresh range inside `geo_lower`. A side-by-side example would make 
this concrete. 
   
   The geo struct fields use offsets 10–17 from `base-id`, even though 
`geo_lower` is logically the value of `lower_bound` (offset 1). The flattened 
ID layout is unusual and a worked example (e.g., a `geometry` field with 
`field-id = 2` producing IDs `10_400` for the stats struct, `10_401` for 
`lower_bound` aka `geo_lower`, and `10_410..10_413` for x/y/z/m) would remove 
ambiguity.



##########
format/spec.md:
##########
@@ -704,11 +727,111 @@ Examples of valid field paths using normalized JSON path 
format are:
 * `$['tags']` -- the `tags` array
 * `$['addresses']['zip']` -- the `zip` field in an `addresses` array that 
contains objects
 
-For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are 
both points of the following coordinates X, Y, Z, and M (see Appendix G) which 
are the lower / upper bound of all objects in the file.
+##### Content Stats
 
-For `geography` only, xmin (X value of `lower_bounds`) may be greater than 
xmax (X value of `upper_bounds`), in which case an object in this bounding box 
may match if it contains an X such that x >= xmin OR x <= xmax. In geographic 
terminology, the concepts of xmin, xmax, ymin, and ymax are also known as 
westernmost, easternmost, southernmost and northernmost, respectively. These 
points are further restricted to the canonical ranges of [-180..180] for X and 
[-90..90] for Y.
+In Iceberg v4, statistics are stored in typed fields grouped in a struct that 
corresponds to the table field. These stats structs are nested within the 
`content_stats` struct in manifest files.
 
-When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
+###### Field Statistics
+
+Field-level structs in `content_stats` are based on the corresponding table 
field's type, requirement, and ID (`field-id`).
+
+Field stats structs are assigned a range of 200 IDs, starting at `10_000 + 200 
* field-id`. The first ID in the range (`base-id`) is the ID of the struct 
field in `content_stats`. Fields within the stats struct are assigned IDs from 
the range by adding an offset to the `base-id`. For example, the stats struct 
for table field 2 uses IDs in the range `[10_400, 10_599]`, the field within 
`content_stats` has uses the `base-id`, ID `10_400`, and its `lower_bound` 
field (offset 1) uses ID `10_401`.
+
+Content stats must be resovled by ID; field names used for stats structs are 
informational. The recommended name for each field is the full name of the 
field in the table schema.
+
+IDs in the range `10_000` (inclusive) to `200_000_000` (exclusive) are 
reserved for column stats structs in `content_stats`. Stats for table fields 
with IDs outside that range cannot be stored in `content_stats`.
+
+[Reserved metadata fields](#reserved-field-ids) must use the stats ID ranges 
from the following table. Stats for metadata fields not in the table are not 
tracked.
+
+| Reserved field                  | ID         | `base-id` | Range end |
+|---------------------------------|------------|-----------|-----------|
+| `_last_updated_sequence_number` | 2147483539 | 9000      | 9199 |
+| `_row_id`                       | 2147483540 | 9200      | 9399 |
+
+Each stats struct holds statistics for one table field. It may contain the 
following metrics:
+
+| Requirement | Offset | Name                      | Type                      
| Included for                                  | Description |
+|-------------|--------|---------------------------|---------------------------|-----------------------------------------------|-------------|
+| _optional_  | 1      | `lower_bound`             | Field type or `geo_lower` 
| all primitives or `variant`                   | Lower bound stored as the 
field's type, or `geo_lower` for geo types |
+| _optional_  | 2      | `upper_bound`             | Field type or `geo_upper` 
| all primitives or `variant`                   | Upper bound stored as the 
field's type, or `geo_upper` for geo types |
+| _optional_  | 3      | `tight_bounds`            | `boolean`                 
| all except `geometry`, `geography`, `variant` | When true, `lower_bound` and 
`upper_bound` must be equal to the min and max values |
+| _optional_  | 4      | `value_count`             | `long`                    
| all                                           | Number of values in the 
column (including null and NaN values) |
+| _optional_  | 5      | `null_value_count`        | `long`                    
| optional fields                               | Number of null values in the 
column |
+| _optional_  | 6      | `nan_value_count`         | `long`                    
| `float`, `double`                             | Number of NaN values in the 
column |
+| _optional_  | 7      | `avg_value_size_in_bytes` | `int`                     
| `string`, `binary`, `variant`                 | Avg value size (uncompressed) 
in bytes to estimate memory consumption |
+
+For example, stats for a `required` `int` field named `id` with field-id `2` 
are stored using:
+
+```
+10_400: optional struct id (default null) {
+  10_401: optional int lower_bound; // type matches the field type (int)
+  10_402: optional int upper_bound; // type matches the field type (int)
+  10_403: optional boolean tight_bounds;
+  10_404: optional long value_count;
+
+  // null_value_count is only used for optional fields
+  // nan_value_count is only used for float and double
+  // avg_value_size_in_bytes is only used for variable length types
+}
+```
+
+If any field is missing from the struct, readers must assume that it is 
unknown.
+
+Lower and upper bounds for `geometry` and `geography` columns are XYZM points 
that define a bounding box, stored in `geo_lower` and `geo_upper` structs (see 
[Bounds for Geometry and Geography](#bounds-for-geometry-and-geography). IDs 
used by geo structs are assigned using offsets in the table field's stats ID 
range.
+
+The `geo_lower` struct is defined as:
+
+| Requirement | Offset | Name | Type     | Description |
+|-------------|--------|------|----------|-------------|
+| _required_  | 10     | `x`  | `double` | Bounding box westernmost/xmin; 
[-180..180] |

Review Comment:
   Why do `geo_lower` offsets start at 10 and not 8? 
   
   Two options worth deciding here:
   1. Compact geo offsets to 8–11 / 12–15 to eliminate the gap.
   2. Keep the gap and add a note that offsets 8–9 are reserved for future 
common metrics — so readers do not wonder, and so adding 1 or 2 new metrics 
does not need to renumber.
   
   Either is fine, but the current state (silent gap) may leave future readers 
wondering.



##########
format/spec.md:
##########
@@ -654,6 +654,7 @@ The `data_file` struct consists of the following fields:
     | _required_ | _required_ | _required_ | **`102  partition`**              
| `struct<...>`                                                               | 
Partition data tuple, schema based on the partition spec output using partition 
field ids for the struct field ids |
     | _required_ | _required_ | _required_ | **`103  record_count`**           
| `long`                                                                      | 
Number of records in this file, or the cardinality of a deletion vector |
     | _required_ | _required_ | _required_ | **`104  file_size_in_bytes`**     
| `long`                                                                      | 
Total file size in bytes |
+    |            |            |            | **`146  content_stats`**          
| `content_stats` `struct`                                                    | 
Container struct for per-field metrics structs. See [Content 
Stats](#content-stats) |

Review Comment:
   shouldn't this be added to the v4 table? I remember @nastra mentioned it 
should be added part of the larger v4 metadata PR #16025 from Amogh. Or are we 
putting it here first and will move it when the v4 table is added later.



##########
format/spec.md:
##########
@@ -704,11 +727,111 @@ Examples of valid field paths using normalized JSON path 
format are:
 * `$['tags']` -- the `tags` array
 * `$['addresses']['zip']` -- the `zip` field in an `addresses` array that 
contains objects
 
-For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are 
both points of the following coordinates X, Y, Z, and M (see Appendix G) which 
are the lower / upper bound of all objects in the file.
+##### Content Stats
 
-For `geography` only, xmin (X value of `lower_bounds`) may be greater than 
xmax (X value of `upper_bounds`), in which case an object in this bounding box 
may match if it contains an X such that x >= xmin OR x <= xmax. In geographic 
terminology, the concepts of xmin, xmax, ymin, and ymax are also known as 
westernmost, easternmost, southernmost and northernmost, respectively. These 
points are further restricted to the canonical ranges of [-180..180] for X and 
[-90..90] for Y.
+In Iceberg v4, statistics are stored in typed fields grouped in a struct that 
corresponds to the table field. These stats structs are nested within the 
`content_stats` struct in manifest files.
 
-When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
+###### Field Statistics
+
+Field-level structs in `content_stats` are based on the corresponding table 
field's type, requirement, and ID (`field-id`).
+
+Field stats structs are assigned a range of 200 IDs, starting at `10_000 + 200 
* field-id`. The first ID in the range (`base-id`) is the ID of the struct 
field in `content_stats`. Fields within the stats struct are assigned IDs from 
the range by adding an offset to the `base-id`. For example, the stats struct 
for table field 2 uses IDs in the range `[10_400, 10_599]`, the field within 
`content_stats` has uses the `base-id`, ID `10_400`, and its `lower_bound` 
field (offset 1) uses ID `10_401`.
+
+Content stats must be resovled by ID; field names used for stats structs are 
informational. The recommended name for each field is the full name of the 
field in the table schema.
+
+IDs in the range `10_000` (inclusive) to `200_000_000` (exclusive) are 
reserved for column stats structs in `content_stats`. Stats for table fields 
with IDs outside that range cannot be stored in `content_stats`.
+
+[Reserved metadata fields](#reserved-field-ids) must use the stats ID ranges 
from the following table. Stats for metadata fields not in the table are not 
tracked.
+
+| Reserved field                  | ID         | `base-id` | Range end |
+|---------------------------------|------------|-----------|-----------|
+| `_last_updated_sequence_number` | 2147483539 | 9000      | 9199 |
+| `_row_id`                       | 2147483540 | 9200      | 9399 |
+
+Each stats struct holds statistics for one table field. It may contain the 
following metrics:
+
+| Requirement | Offset | Name                      | Type                      
| Included for                                  | Description |
+|-------------|--------|---------------------------|---------------------------|-----------------------------------------------|-------------|
+| _optional_  | 1      | `lower_bound`             | Field type or `geo_lower` 
| all primitives or `variant`                   | Lower bound stored as the 
field's type, or `geo_lower` for geo types |
+| _optional_  | 2      | `upper_bound`             | Field type or `geo_upper` 
| all primitives or `variant`                   | Upper bound stored as the 
field's type, or `geo_upper` for geo types |
+| _optional_  | 3      | `tight_bounds`            | `boolean`                 
| all except `geometry`, `geography`, `variant` | When true, `lower_bound` and 
`upper_bound` must be equal to the min and max values |
+| _optional_  | 4      | `value_count`             | `long`                    
| all                                           | Number of values in the 
column (including null and NaN values) |
+| _optional_  | 5      | `null_value_count`        | `long`                    
| optional fields                               | Number of null values in the 
column |
+| _optional_  | 6      | `nan_value_count`         | `long`                    
| `float`, `double`                             | Number of NaN values in the 
column |
+| _optional_  | 7      | `avg_value_size_in_bytes` | `int`                     
| `string`, `binary`, `variant`                 | Avg value size (uncompressed) 
in bytes to estimate memory consumption |
+
+For example, stats for a `required` `int` field named `id` with field-id `2` 
are stored using:
+
+```
+10_400: optional struct id (default null) {
+  10_401: optional int lower_bound; // type matches the field type (int)
+  10_402: optional int upper_bound; // type matches the field type (int)
+  10_403: optional boolean tight_bounds;
+  10_404: optional long value_count;
+
+  // null_value_count is only used for optional fields
+  // nan_value_count is only used for float and double
+  // avg_value_size_in_bytes is only used for variable length types
+}
+```
+
+If any field is missing from the struct, readers must assume that it is 
unknown.
+
+Lower and upper bounds for `geometry` and `geography` columns are XYZM points 
that define a bounding box, stored in `geo_lower` and `geo_upper` structs (see 
[Bounds for Geometry and Geography](#bounds-for-geometry-and-geography). IDs 
used by geo structs are assigned using offsets in the table field's stats ID 
range.
+
+The `geo_lower` struct is defined as:
+
+| Requirement | Offset | Name | Type     | Description |
+|-------------|--------|------|----------|-------------|
+| _required_  | 10     | `x`  | `double` | Bounding box westernmost/xmin; 
[-180..180] |
+| _required_  | 11     | `y`  | `double` | Bounding box southernmost/ymin; 
[-90..90] |
+| _optional_  | 12     | `z`  | `double` | Bounding box zmin |
+| _optional_  | 13     | `m`  | `double` | Bounding box mmin |
+
+The `geo_upper` struct is defined as:
+
+| Requirement | Offset | Name | Type     | Description |
+|-------------|--------|------|----------|-------------|
+| _required_  | 14     | `x`  | `double` | Bounding box eastermost/xmax; 
[-180..180] |

Review Comment:
   Typo: `eastermost` should be `easternmost`. Line 706 above already uses 
`easternmost`, so this is also an internal inconsistency.
   
   ```suggestion
   | _required_  | 14     | `x`  | `double` | Bounding box easternmost/xmax; 
[-180..180] |
   ```



##########
format/spec.md:
##########
@@ -704,11 +727,111 @@ Examples of valid field paths using normalized JSON path 
format are:
 * `$['tags']` -- the `tags` array
 * `$['addresses']['zip']` -- the `zip` field in an `addresses` array that 
contains objects
 
-For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are 
both points of the following coordinates X, Y, Z, and M (see Appendix G) which 
are the lower / upper bound of all objects in the file.
+##### Content Stats
 
-For `geography` only, xmin (X value of `lower_bounds`) may be greater than 
xmax (X value of `upper_bounds`), in which case an object in this bounding box 
may match if it contains an X such that x >= xmin OR x <= xmax. In geographic 
terminology, the concepts of xmin, xmax, ymin, and ymax are also known as 
westernmost, easternmost, southernmost and northernmost, respectively. These 
points are further restricted to the canonical ranges of [-180..180] for X and 
[-90..90] for Y.
+In Iceberg v4, statistics are stored in typed fields grouped in a struct that 
corresponds to the table field. These stats structs are nested within the 
`content_stats` struct in manifest files.
 
-When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
+###### Field Statistics
+
+Field-level structs in `content_stats` are based on the corresponding table 
field's type, requirement, and ID (`field-id`).
+
+Field stats structs are assigned a range of 200 IDs, starting at `10_000 + 200 
* field-id`. The first ID in the range (`base-id`) is the ID of the struct 
field in `content_stats`. Fields within the stats struct are assigned IDs from 
the range by adding an offset to the `base-id`. For example, the stats struct 
for table field 2 uses IDs in the range `[10_400, 10_599]`, the field within 
`content_stats` has uses the `base-id`, ID `10_400`, and its `lower_bound` 
field (offset 1) uses ID `10_401`.
+
+Content stats must be resovled by ID; field names used for stats structs are 
informational. The recommended name for each field is the full name of the 
field in the table schema.
+
+IDs in the range `10_000` (inclusive) to `200_000_000` (exclusive) are 
reserved for column stats structs in `content_stats`. Stats for table fields 
with IDs outside that range cannot be stored in `content_stats`.
+
+[Reserved metadata fields](#reserved-field-ids) must use the stats ID ranges 
from the following table. Stats for metadata fields not in the table are not 
tracked.
+
+| Reserved field                  | ID         | `base-id` | Range end |
+|---------------------------------|------------|-----------|-----------|
+| `_last_updated_sequence_number` | 2147483539 | 9000      | 9199 |
+| `_row_id`                       | 2147483540 | 9200      | 9399 |
+
+Each stats struct holds statistics for one table field. It may contain the 
following metrics:
+
+| Requirement | Offset | Name                      | Type                      
| Included for                                  | Description |
+|-------------|--------|---------------------------|---------------------------|-----------------------------------------------|-------------|
+| _optional_  | 1      | `lower_bound`             | Field type or `geo_lower` 
| all primitives or `variant`                   | Lower bound stored as the 
field's type, or `geo_lower` for geo types |
+| _optional_  | 2      | `upper_bound`             | Field type or `geo_upper` 
| all primitives or `variant`                   | Upper bound stored as the 
field's type, or `geo_upper` for geo types |
+| _optional_  | 3      | `tight_bounds`            | `boolean`                 
| all except `geometry`, `geography`, `variant` | When true, `lower_bound` and 
`upper_bound` must be equal to the min and max values |
+| _optional_  | 4      | `value_count`             | `long`                    
| all                                           | Number of values in the 
column (including null and NaN values) |
+| _optional_  | 5      | `null_value_count`        | `long`                    
| optional fields                               | Number of null values in the 
column |
+| _optional_  | 6      | `nan_value_count`         | `long`                    
| `float`, `double`                             | Number of NaN values in the 
column |
+| _optional_  | 7      | `avg_value_size_in_bytes` | `int`                     
| `string`, `binary`, `variant`                 | Avg value size (uncompressed) 
in bytes to estimate memory consumption |
+
+For example, stats for a `required` `int` field named `id` with field-id `2` 
are stored using:
+
+```
+10_400: optional struct id (default null) {
+  10_401: optional int lower_bound; // type matches the field type (int)
+  10_402: optional int upper_bound; // type matches the field type (int)
+  10_403: optional boolean tight_bounds;
+  10_404: optional long value_count;
+
+  // null_value_count is only used for optional fields
+  // nan_value_count is only used for float and double
+  // avg_value_size_in_bytes is only used for variable length types
+}
+```
+
+If any field is missing from the struct, readers must assume that it is 
unknown.
+
+Lower and upper bounds for `geometry` and `geography` columns are XYZM points 
that define a bounding box, stored in `geo_lower` and `geo_upper` structs (see 
[Bounds for Geometry and Geography](#bounds-for-geometry-and-geography). IDs 
used by geo structs are assigned using offsets in the table field's stats ID 
range.
+
+The `geo_lower` struct is defined as:
+
+| Requirement | Offset | Name | Type     | Description |
+|-------------|--------|------|----------|-------------|
+| _required_  | 10     | `x`  | `double` | Bounding box westernmost/xmin; 
[-180..180] |
+| _required_  | 11     | `y`  | `double` | Bounding box southernmost/ymin; 
[-90..90] |
+| _optional_  | 12     | `z`  | `double` | Bounding box zmin |
+| _optional_  | 13     | `m`  | `double` | Bounding box mmin |
+
+The `geo_upper` struct is defined as:
+
+| Requirement | Offset | Name | Type     | Description |
+|-------------|--------|------|----------|-------------|
+| _required_  | 14     | `x`  | `double` | Bounding box eastermost/xmax; 
[-180..180] |
+| _required_  | 15     | `y`  | `double` | Bounding box northernmost/ymax; 
[-90..90] |
+| _optional_  | 16     | `z`  | `double` | Bounding box zmax |
+| _optional_  | 17     | `m`  | `double` | Bounding box mmax |
+
+For `variant`, both bounds are unshredded `variant` that store variant field 
bounds by normalized JSON paths as field names. See [Bounds for 
Variant](#bounds-for-variant) for details on producing these bounds.
+
+###### Content Stats in Manifests
+
+Manifest files are written using a specific `content_stats` struct type, 
determined by the writer and incorporated into the manifest schema. All 
field-level structs are optional fields in the `content_stats` struct.
+
+For example, stats for a table with a required int, `id`, and an optional 
string, `data`, are stored as:
+
+```
+146: optional struct content_stats {
+  // stats struct for table field 2: required int id
+  10_400: optional struct id (default null) {
+    10_401: optional int lower_bound;
+    10_402: optional int upper_bound;
+    10_403: optional boolean tight_bounds;
+    10_404: optional long value_count;
+  }
+
+  // stats struct for table field 3: optional string data
+  10_600: optional struct data (default null) {
+    10_601: optional string lower_bound;
+    10_602: optional string upper_bound;
+    10_603: optional boolean tight_bounds;
+    10_604: optional long value_count;
+    10_605: optional long null_value_count;
+    10_607: optional int avg_value_size_in_bytes;
+  }
+}
+```
+
+Implementations may produce stats structs for fields that are not in the table 
schema, if a field ID from the table's column ID space is assigned for the data 
values (by allocating an ID using `last-column-id`). Implementations are not 
required to write a stats struct for every table field.
+
+Fields with stats tracked in `content_stats` change based on updates like 
schema evolution or metrics configuration. Writers adapt to table changes by 
writing new manifest files with the implementation's current `content_stats` 
type. When existing file metadata is written to new manifests, writers must 
discard old stats, set unknown stats structs to null, and promote lower and 
upper bounds types to conform to the manifest schema.

Review Comment:
   This currently covers only type promotion. The earlier suggestion to spell 
out add/drop/rename column was acknowledged on 2026-04-24 ("good point, I'll 
add those to this section") but the landed text only mentions "updates like 
schema evolution" generically. Worth stating each case explicitly:
   
   - **Add column**: a new stats struct field appears in newer manifests; older 
manifests without the field resolve via the v4 field default of `null`.
   - **Drop column**: when rewriting manifests with the current schema, the 
dropped column's stats struct is omitted; older manifests retain it and readers 
ignore it.
   - **Rename column**: ID-based resolution makes rename invisible on the wire; 
only the recommended name (the informational struct field name from the 
resolution rule above) changes when manifests are rewritten with the current 
schema.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spec: Add content stats to spec [iceberg]

Reply via email to