WZhuo opened a new pull request, #727:
URL: https://github.com/apache/iceberg-cpp/pull/727

   ## What
   
   Collects NaN value counts for float and double columns during Parquet 
writes, since the Parquet footer statistics do not track NaN counts.
   
   ## Changes
   
   - **Write-side NaN metric collection** (`FieldMetricsCollector`): A visitor 
that walks each record batch before writing, accumulating value counts, null 
counts, NaN counts, and NaN-excluding lower/upper bounds for float/double 
fields.
   - **MetricsConfig-aware skipping**: Fields whose `MetricsMode` is `kNone` 
are skipped entirely, avoiding wasted work.
   - **Integration with existing footer metrics**: Write-side `FieldMetrics` 
take precedence over footer statistics in `ParquetMetrics::GetMetrics`, so NaN 
counts are populated while counts/bounds still fall back to footer stats when 
write-side data isn't available.
   - **Tests**: `ParquetMetricsTest` now overrides `ReportsNanCounts()` to 
`true`, and existing NaN test cases verify NaN counts alongside existing 
value/null count assertions.
   
   ## Behavior alignment with Java
   
   - Fields nested inside lists/maps do not get NaN metrics (both Java and C++ 
agree — Java collects then discards; C++ skips collection entirely).
   - NaN values are excluded from lower/upper bounds in both implementations.
   - Float/double fields with all-NaN values correctly set `nan_value_count` 
without setting bounds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to