emkornfield commented on issue #13855: URL: https://github.com/apache/iceberg/issues/13855#issuecomment-3246179320
> Currently metrics are not good for this because we have almost no way to determine the difference between, didn't store metrics for a column and column wasn't written. In general we assume that if we don't see metrics, that column exists. I agree with this. > Linking Schema has a similar issue, If I don't have metrics for an optional column it could be missing or It could have values, so I can't make the call. IIUC, the proposal is not meant to address this. There are two cases: 1. The column did not exist at the time or writing (and therefore we couldn't possibly write statistics for that file) so we need some additional metadata to record this fact (e.g. schema ID). 2. Some writers are not writing out optional columns even if they are in the schema or we want to identify all null values for a column that existed in the schema. (1) seems reasonable. (2) seems like it can already covered (see below). > I just worry that the more common case is schemas with optional columns (possibly many many optional columns) where we aren't storing metrics. This is an implementation issue though? It seems it can be mostly solved by adapting metadata writers to write out null-counts when null-count = value count? (It might be a little awkward in V4). But we have a [current PR reiterating that all columns must be written even if values are null](https://github.com/apache/iceberg/pull/13936). Or is the argument here solely that the cost to store the statistics in the current data-structures too expensive? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
