XBaith commented on code in PR #12658:
URL: https://github.com/apache/iceberg/pull/12658#discussion_r2015673209


##########
format/spec.md:
##########
@@ -648,6 +648,9 @@ Notes:
 5. The `content_offset` and `content_size_in_bytes` fields are used to 
reference a specific blob for direct access to a deletion vector. For deletion 
vectors, these values are required and must exactly match the `offset` and 
`length` stored in the Puffin footer for the deletion vector blob.
 6. The following field ids are reserved on `data_file`: 141.
 
+For `variant` type, the `lower_bounds` and `upper_bounds` store the minimum 
and maximum values for all shredded subcolumns within a file. These bounds are 
represented as a Variant object, where each subcolumn path serves as a key and 
the corresponding bound value as the value. The object is then serialized into 
binary format (see [Variant 
encoding](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md)).

Review Comment:
   > Lower and upper bound statistics for subcolumns are collected for each 
data file based on the following conditions:
   Uniform Value Types:
   If all subcolumn values match the shredded type, lower/upper bounds are 
collected.
   Example: For event.location.longitude, if all values are of the double type, 
the lower/upper bounds are written to the manifest file.
   Mixed Value Types:
   If the subcolumn contains multiple types (e.g., double and string), 
lower/upper bound statistics are not collected.
   Example: For event.location.longitude, if the values include both double and 
string, lower/upper bounds are excluded.
   Some subcolumn values are nulls or missing:
   If some subcolumn values are null or missing in a file, but the available 
values match the shredded type, lower/upper bounds are still collected.
   If all the subcolumn values are nulls, then lower/upper bounds are not 
collected. null_value_counts stat can be collected in later implementation to 
be used with value_counts to know they are all nulls.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to