Re: [I] Implement nan_value_counts && distinct_counts metrics in parquet writer [iceberg-rust]

via GitHub Sun, 01 Dec 2024 03:25:14 -0800


feniljain commented on issue #417:
URL: https://github.com/apache/iceberg-rust/issues/417#issuecomment-2509721763


   Hey @Fokko ! 👋🏻 
   
   As the original author has not replied, I am interested in taking it up :)
   
   Few points regardless who this gets assigned to:
   
   - I couldn't see `distinct_counts` in java or python documentation, am I 
reading them wrong somewhere, if they are present can someone point me to them 
please? Also distinct counts are present on `ColumnChunk` level, but they would 
not be possible to aggregate at `DataFile` level cause fields can be same 
between two different `ColumnChunk`. Am I understanding this correctly?
   - For `NaN` value counts, as the 
[javadoc](https://iceberg.apache.org/javadoc/1.4.1/org/apache/iceberg/DoubleFieldMetrics.html)
 mentions:
   ```
   Parquet/ORC keeps track of most metrics in file statistics, and only NaN 
counter is actually tracked by writers. This wrapper ensures that metrics not 
being updated by those writers will not be incorrectly used, by throwing 
exceptions when they are accessed.
   ```
   We will have to keep track of it on our own, so I think we would go through 
each `Field` in each `Column`s of `RecordBatch` supplied 
[here](https://github.com/apache/iceberg-rust/blob/f3a571d355041e50608c87d44fb042bdf7dfca1e/crates/iceberg/src/writer/file_writer/parquet_writer.rs#L397)
 and find float values then count `NaN`s in it. Is this understanding correct?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Implement nan_value_counts && distinct_counts metrics in parquet writer [iceberg-rust]

Reply via email to