rdblue commented on code in PR #12658:
URL: https://github.com/apache/iceberg/pull/12658#discussion_r2049493228


##########
format/spec.md:
##########
@@ -648,6 +648,9 @@ Notes:
 5. The `content_offset` and `content_size_in_bytes` fields are used to 
reference a specific blob for direct access to a deletion vector. For deletion 
vectors, these values are required and must exactly match the `offset` and 
`length` stored in the Puffin footer for the deletion vector blob.
 6. The following field ids are reserved on `data_file`: 141.
 
+For `variant` type, the `lower_bounds` and `upper_bounds` store the lower and 
upper bounds for the shredded fields within a file with the following 
considerations: 1) Bounds for array data are not collected; 2) The lower / 
upper bounds are collected only if all field data share the same shredded type 
or if the data is missing. These bounds are represented as a Variant object, 
where each field path serves as a key and the corresponding bound value as the 
value. The object is then serialized into binary format (see [Variant 
encoding](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md)).

Review Comment:
   ```suggestion
   For Variant, values in the `lower_bounds` and `upper_bounds` maps store the 
a serialized Variant object that contains lower and upper bounds for fields 
within the Variant. The object keys are normalized JSON path expressions that 
uniquely identify a Variant field. The object values are primitive Variant 
representations of the lower or upper bound for the field. Including bounds for 
any field is optional and the bounds must have the same Variant type.
   
   Bounds for a field must be accurate for all non-null values of the field in 
a data file. Bounds for values within arrays must be accurate all values in the 
array. Bounds must not be written to describe values with mixed Variant types 
(other than null). For example, a "measurement" field that contains int64 and 
null values may have bounds, but a string value such as "n/a" or "0" in any 
record would cause the bounds to be skipped.
   
   The Variant bounds objects are serialized by concatenating the [Variant 
encoding](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md)
 of the metadata (containing the normalized field paths) and the bounds object.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to