feniljain commented on issue #417: URL: https://github.com/apache/iceberg-rust/issues/417#issuecomment-2509721763
Hey @Fokko ! 👋🏻 As the original author has not replied, I am interested in taking it up :) Few points regardless who this gets assigned to: - I couldn't see `distinct_counts` in java or python documentation, am I reading them wrong somewhere, if they are present can someone point me to them please? Also distinct counts are present on `ColumnChunk` level, but they would not be possible to aggregate at `DataFile` level cause fields can be same between two different `ColumnChunk`. Am I understanding this correctly? - For `NaN` value counts, as the [javadoc](https://iceberg.apache.org/javadoc/1.4.1/org/apache/iceberg/DoubleFieldMetrics.html) mentions: ``` Parquet/ORC keeps track of most metrics in file statistics, and only NaN counter is actually tracked by writers. This wrapper ensures that metrics not being updated by those writers will not be incorrectly used, by throwing exceptions when they are accessed. ``` We will have to keep track of it on our own, so I think we would go through each `Field` in each `Column`s of `RecordBatch` supplied [here](https://github.com/apache/iceberg-rust/blob/f3a571d355041e50608c87d44fb042bdf7dfca1e/crates/iceberg/src/writer/file_writer/parquet_writer.rs#L397) and find float values then count `NaN`s in it. Is this understanding correct? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org