zhongyujiang commented on issue #6516: URL: https://github.com/apache/iceberg/issues/6516#issuecomment-1370862183
I encountered the same problem too. > I'm not sure why having a NaN would mean that the statistics should report there are no non-null values I had the same doubt, after digged a bit more, I think the doc here is trying to say "whether there have been non-null values added to this statsistics", not "whether there is a non-null value in the column chunk". If I understand correctly, statistics become unreliable when NaN values are present, so parquet will just discard the statistics that have been added and set hasNonNullValue to false. So I think the root cause of this is that ParquetMetricRowGroupFilter mistakenly used `hasNonNull()` to determine whether there is a non-null value in the column chunk. As above, this method cannot be used for such purpose. I think we can only conclude that there is no non-null value in the column chunk when `Statistics#getNumNulls() = ColumnChunkMetadata#getValueCount()`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org