emkornfield commented on issue #13855:
URL: https://github.com/apache/iceberg/issues/13855#issuecomment-3275784592

   > I assume you are referring "metrics" when you mention "stats".
   
   Yes, the spec doesn't have a name for them in aggregate but was was 
referring to  `value_counts` and `null_value_counts` from the spec.
   
   > If you look at the discussions 
https://github.com/apache/iceberg/pull/13398#discussion_r2170905401, which was 
one of the primary reasons not to depend on metrics as it DOES NOT cover for 
ALL columns and driven by write.metadata.metrics.max-inferred-column-defaults 
configuration. If a schema has 20 columns and max inferred column says 10 
columns, metrics would be generated only for 10 columns. Metrics like 
valueCounts, nullValueCounts, nanValueCounts and so on would be generated only 
for certain columns in the schema, not for all.
   
   I don't think this should be ruled out, or at least there seems to be a path 
here without changing the specification. This configuration as far as I can 
tell is not part of the spec.  I think the disconnect we have is one could add 
a new configuration named something like 
`write.metadata.metrics.always_record_all_null_metrics` which could always 
write the necessary metrics for implementations to infer if the column is 
entirely null once it exists in the schema.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to