zhongyujiang commented on issue #6516:
URL: https://github.com/apache/iceberg/issues/6516#issuecomment-1370862183

   I encountered the same problem too.
   > I'm not sure why having a NaN would mean that the statistics should report 
there are no non-null values
   
   I had the same doubt, after digged a bit more, I think the doc here is 
trying to say "whether there have been non-null values added to this 
statsistics", not "whether there is a non-null value in the column chunk". If I 
understand correctly,   statistics become unreliable when NaN values are 
present, so parquet will just discard the statistics that have been added and 
set hasNonNullValue to false.
   
   So I think the root cause of this is that ParquetMetricRowGroupFilter 
mistakenly used `hasNonNull()` to determine whether there is a non-null value 
in the column chunk. As above, this method cannot be used for such purpose. I 
think we can only conclude that there is no non-null value in the column chunk 
when 
   `Statistics#getNumNulls() = ColumnChunkMetadata#getValueCount()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to