manirajv06 commented on issue #13855: URL: https://github.com/apache/iceberg/issues/13855#issuecomment-3261174709
`Schema` already has `highestFieldId` . Using the linked schema for the given file, it is as simple as using `highestFieldId` to decide whether the column existed in the schema or not. We would come to know that column existed in the schema existed or not. But it does not mean that column has been actually written if it is optional and don't think writers write null value in this case (based on my understanding and also can be confirmed based on @RussellSpitzer's earlier response as well). I assume you are referring "metrics" when you mention "stats". If you look at the discussions [here](https://github.com/apache/iceberg/pull/13398#discussion_r2170905401), which was one of the primary reasons not to depend on metrics as it DOES NOT cover for ALL columns and driven by `write.metadata.metrics.max-inferred-column-defaults` configuration. If a schema has 20 columns and `max inferred column` says 10 columns, metrics would be generated only for 10 columns. Metrics like `valueCounts`, `nullValueCounts`, `nanValueCounts` and so on would be generated only for certain columns in the schema, not for all. So, depending on the metrics to decide whether the column has been written or not approach has been completely ruled out. When we discussed the next option to solve this, we had come up with `linking schema id with file` approach. But even that was not completely solving our problem for the reasons discussed above (Optional fields with null values). Then we had come up with "columns written" metric similar to `rowCount` not bounded by a bove config limit to have all the columns written in the file. "columns written" could contain all the field Id (Integer) of all columns written in the file. I hope this gives you more info and fixes the gap in our understanding. Please share your thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
