manirajv06 commented on issue #13855:
URL: https://github.com/apache/iceberg/issues/13855#issuecomment-3261174709

   `Schema` already has `highestFieldId` . Using the linked schema for the 
given file, it is as simple as using `highestFieldId` to decide whether the 
column existed in the schema or not. We would come to know that column existed 
in the schema existed or not. But it does not mean that column has been 
actually written if it is optional and don't think writers write null value in 
this case (based on my understanding and also can be confirmed based on 
@RussellSpitzer's earlier response as well).
   
   I assume you are referring "metrics" when you mention "stats". 
   
   If you look at the discussions 
[here](https://github.com/apache/iceberg/pull/13398#discussion_r2170905401), 
which was one of the primary reasons not to depend on metrics as it DOES NOT 
cover for ALL columns and driven by 
`write.metadata.metrics.max-inferred-column-defaults` configuration. If a 
schema has 20 columns and `max inferred column` says 10 columns, metrics would 
be generated only for 10 columns. Metrics like `valueCounts`, 
`nullValueCounts`,  `nanValueCounts` and so on would be generated only for 
certain columns in the schema, not for all. So, depending on the metrics to 
decide whether the column has been written or not approach has been completely 
ruled out. When we discussed the next option to solve this, we had come up with 
`linking schema id with file` approach. But even that was not completely 
solving our problem for the reasons discussed above (Optional fields with null 
values). Then we had come up with "columns written" metric similar to 
`rowCount` not bounded by a
 bove config limit to have all the columns written in the file. "columns 
written" could contain all the field Id (Integer) of all columns written in the 
file.
   
   I hope this gives you more info and fixes the gap in our understanding. 
Please share your thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to