RussellSpitzer opened a new issue, #6516:
URL: https://github.com/apache/iceberg/issues/6516

   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   Summary:
   This effects files where the parquet statistics mark either the min or max 
of a given column to NaN with Float or Double type (I have not checked other 
types) which will have their row groups skipped incorrectly. 
   
   In this case our evaluators are handling undefined min/max statistics 
incorrectly. When one of these two stats is **NaN** parquet will set hasNonNull 
equal to false.
   
   
https://github.com/apache/parquet-mr/blob/433de8df33fcf31927f7b51456be9f53e64d48b9/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java#L89-L92
   
   ```java
           // Drop min/max values in case of NaN as the sorting order of values 
is undefined for this case
           if (min.isNaN() || max.isNaN()) {
             stats.setMinMax(0.0f, 0.0f);
             ((Statistics<?>) stats).hasNonNullValue = false;
   ```
   
   Unfortunately when evaluating `gt` we check whether this is "false" and 
return "rows won't match" if it is. 
   
   See
   
https://github.com/apache/iceberg/blob/cf00f6a06b256e9c4defe226b6a37aa83c40f561/parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java#L279-L281
   
   I'm unsure what the intent in the Parquet code is here and it seems to be a 
bug to me in the parquet library. I'm not sure why having a NaN would mean that 
the statistics should report there are no non-null values. According to the 
Java Doc in the parquet library 
   
   ```java
     /**
      * Returns whether there have been non-null values added to this statistics
      *
      * @return true if the values contained at least one non-null value
      */
   ```
   
   Which implies to me that this should not be false if we added a NaN.
   
   While we probably should try to fix this on the Parquet Side I think we can 
fix this on the iceberg side by not using the "hasNonNull" in our metric checks 
if the value count is > 0? 
   
   Thanks John Pugliesi for reporting this and providing a repo 🙇 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to