RussellSpitzer opened a new issue, #6516: URL: https://github.com/apache/iceberg/issues/6516
### Apache Iceberg version 1.1.0 (latest release) ### Query engine Spark ### Please describe the bug 🐞 Summary: This effects files where the parquet statistics mark either the min or max of a given column to NaN with Float or Double type (I have not checked other types) which will have their row groups skipped incorrectly. In this case our evaluators are handling undefined min/max statistics incorrectly. When one of these two stats is **NaN** parquet will set hasNonNull equal to false. https://github.com/apache/parquet-mr/blob/433de8df33fcf31927f7b51456be9f53e64d48b9/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java#L89-L92 ```java // Drop min/max values in case of NaN as the sorting order of values is undefined for this case if (min.isNaN() || max.isNaN()) { stats.setMinMax(0.0f, 0.0f); ((Statistics<?>) stats).hasNonNullValue = false; ``` Unfortunately when evaluating `gt` we check whether this is "false" and return "rows won't match" if it is. See https://github.com/apache/iceberg/blob/cf00f6a06b256e9c4defe226b6a37aa83c40f561/parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java#L279-L281 I'm unsure what the intent in the Parquet code is here and it seems to be a bug to me in the parquet library. I'm not sure why having a NaN would mean that the statistics should report there are no non-null values. According to the Java Doc in the parquet library ```java /** * Returns whether there have been non-null values added to this statistics * * @return true if the values contained at least one non-null value */ ``` Which implies to me that this should not be false if we added a NaN. While we probably should try to fix this on the Parquet Side I think we can fix this on the iceberg side by not using the "hasNonNull" in our metric checks if the value count is > 0? Thanks John Pugliesi for reporting this and providing a repo 🙇 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org