rdblue commented on code in PR #12658: URL: https://github.com/apache/iceberg/pull/12658#discussion_r2049493228
########## format/spec.md: ########## @@ -648,6 +648,9 @@ Notes: 5. The `content_offset` and `content_size_in_bytes` fields are used to reference a specific blob for direct access to a deletion vector. For deletion vectors, these values are required and must exactly match the `offset` and `length` stored in the Puffin footer for the deletion vector blob. 6. The following field ids are reserved on `data_file`: 141. +For `variant` type, the `lower_bounds` and `upper_bounds` store the lower and upper bounds for the shredded fields within a file with the following considerations: 1) Bounds for array data are not collected; 2) The lower / upper bounds are collected only if all field data share the same shredded type or if the data is missing. These bounds are represented as a Variant object, where each field path serves as a key and the corresponding bound value as the value. The object is then serialized into binary format (see [Variant encoding](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md)). Review Comment: ```suggestion For Variant, values in the `lower_bounds` and `upper_bounds` maps store the a serialized Variant object that contains lower and upper bounds for fields within the Variant. The object keys are normalized JSON path expressions that uniquely identify a Variant field. The object values are primitive Variant representations of the lower or upper bound for the field. Including bounds for any field is optional and the bounds must have the same Variant type. Bounds for a field must be accurate for all non-null values of the field in a data file. Bounds for values within arrays must be accurate all values in the array. Bounds must not be written to describe values with mixed Variant types (other than null). For example, a "measurement" field that contains int64 and null values may have bounds, but a string value such as "n/a" or "0" in any record would cause the bounds to be skipped. The Variant bounds objects are serialized by concatenating the [Variant encoding](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md) of the metadata (containing the normalized field paths) and the bounds object. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org