emkornfield commented on code in PR #14234:
URL: https://github.com/apache/iceberg/pull/14234#discussion_r3268144181
##########
format/spec.md:
##########
@@ -704,11 +727,133 @@ Examples of valid field paths using normalized JSON path
format are:
* `$['tags']` -- the `tags` array
* `$['addresses']['zip']` -- the `zip` field in an `addresses` array that
contains objects
-For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are
both points of the following coordinates X, Y, Z, and M (see Appendix G) which
are the lower / upper bound of all objects in the file.
+##### Content Stats
-For `geography` only, xmin (X value of `lower_bounds`) may be greater than
xmax (X value of `upper_bounds`), in which case an object in this bounding box
may match if it contains an X such that x >= xmin OR x <= xmax. In geographic
terminology, the concepts of xmin, xmax, ymin, and ymax are also known as
westernmost, easternmost, southernmost and northernmost, respectively. These
points are further restricted to the canonical ranges of [-180..180] for X and
[-90..90] for Y.
+In Iceberg v4, statistics are stored in typed fields grouped in a struct that
corresponds to the table field. These stats structs are nested within the
`content_stats` struct in manifest files.
-When calculating upper and lower bounds for `geometry` and `geography`, null
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN)
contributes a value to X but no values to Y, Z, or M dimension bounds. If a
dimension has only null or NaN values, that dimension is omitted from the
bounding box. If either the X or Y dimension is missing then the bounding box
itself is not produced.
+###### Field Statistics
+
+Field-level structs in `content_stats` are based on the corresponding table
field's type, requirement, and ID (`field-id`).
+
+Field stats structs are assigned a range of 200 IDs, starting at `10_000 + 200
* field-id`. The first ID in the range (`base-id`) is the ID of the struct
field in `content_stats`. Fields within the stats struct are assigned IDs from
the range by adding an offset to the `base-id`. For example, the stats struct
for table field 2 uses IDs in the range `[10_400, 10_599]`, the field within
`content_stats` uses the `base-id`, ID `10_400`, and its `lower_bound` field
(offset 1) uses ID `10_401`.
+
+Content stats must be resolved by ID; field names used for stats structs are
informational. The recommended name for each field is the full name of the
field in the table schema.
+
+IDs in the range `10_000` (inclusive) to `200_000_000` (exclusive) are
reserved for column stats structs in `content_stats`. Stats for table fields
with stats IDs outside that range cannot be stored in `content_stats`.
+
+[Reserved metadata fields](#reserved-field-ids) must use the stats ID ranges
from the following table. Stats for metadata fields not in the table are not
tracked.
+
+| Reserved field | ID | `base-id` | Range end |
+|---------------------------------|------------|-----------|-----------|
+| `_last_updated_sequence_number` | 2147483539 | 9000 | 9199 |
+| `_row_id` | 2147483540 | 9200 | 9399 |
+
+Each stats struct holds statistics for one table field. It may contain the
following metrics:
+
+| Requirement | Offset | Name | Type
| Included for | Description |
+|-------------|--------|---------------------------|---------------------------|-----------------------------------------------|-------------|
+| _optional_ | 1 | `lower_bound` | Field type or `geo_lower`
| all primitives or `variant` | Lower bound stored as the
field's type, or `geo_lower` for geo types |
+| _optional_ | 2 | `upper_bound` | Field type or `geo_upper`
| all primitives or `variant` | Upper bound stored as the
field's type, or `geo_upper` for geo types |
+| _optional_ | 3 | `tight_bounds` | `boolean`
| all except `geometry`, `geography`, `variant` | When true, `lower_bound` and
`upper_bound` must be equal to the min and max values |
+| _optional_ | 4 | `value_count` | `long`
| all | Number of values in the
column (including null and NaN values) |
+| _optional_ | 5 | `null_value_count` | `long`
| optional fields | Number of null values in the
column |
+| _optional_ | 6 | `nan_value_count` | `long`
| `float`, `double` | Number of NaN values in the
column |
+| _optional_ | 7 | `avg_value_size_in_bytes` | `int`
| `string`, `binary`, `variant` | Avg value size (uncompressed)
in bytes to estimate memory consumption |
Review Comment:
>> When this is propagated to engines is this value aggregated (do they see
a number per file or a final average for the column)?
> I don't think that this is relevant to the spec. What would this change?
@rdblue I was not accessing about the storage of the aggregation explicitly
but more functionally how we intend to use this data in general.
My mental model is we will be doing scan planning which includes potential
predicate and column push down this results in effectively a list of objects.
Something like:
[ {"file1", col_x: `avg_len_in_bytes=1024`,
col_y=`avg_length_in_bytes=2048`}, {"file1", col_x: `avg_len_in_bytes=1025`,
col_y=`avg_length_in_bytes=2049`}]
If we are trying to ultimately get to a single `avg_length_in_bytes` then we
need to re-weight each value every time we want to aggregate these value. This
adds both extra operations on the write side (need to compute this from total
bytes and number of values written to parquet) as well as extra operations on
the read side.
If this mental model is correct, I think it becomes relevant to the
specification because we have IMO an opportunity to simplify by keeping
`total_legnth_in_bytes` in metadata, and deriving the average on read without
extra operations.
> @emkornfield do you have any preference of keeping total_size_in_bytes for
easier aggregation but deriving avg every time when using, instead of keeping
avg_size_in_bytes to avoid the need of deriving it on usegae but making
aggregation slightly more complicated?
@gaborkaszab I think this is my question. On the surface it doesn't seem
like avg_size_in_bytes really saves any derivation for the intended use-cases?
If it does great, but it seems like we might just be adding unnecessary
transforms.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]