Fokko opened a new issue, #34162:
URL: https://github.com/apache/arrow/issues/34162
### Describe the bug, including details regarding any error messages,
version, and platform.
I was working on some test-cases for the PyIceberg integration, and hit this
edge case. When you have a file with only NaN values, it will be skipped when
reading the file with a `is_null(nan_is_null=True)` filter.
In Spark, I create the following table:
```sql
CREATE TABLE test_null_nan
USING iceberg
AS SELECT
1 AS idx,
float('NaN') AS col_numeric
UNION ALL SELECT
2 AS idx,
null AS col_numeric
UNION ALL SELECT
3 AS idx,
1 AS col_numeric
```
This then creates three files with each one record:
```
➜ python git:(fd-integration-tests) ✗ pyiceberg --catalog local files
default.test_null_nan
Snapshots: local.default.test_null_nan
└── Snapshot 870844541941792785, schema 0:
s3a://warehouse/wh/default/test_null_nan/metadata/snap-870844541941792785-1-a05e1621-f735-4837-bb86-ce9886da3e6b.avro
└── Manifest:
s3a://warehouse/wh/default/test_null_nan/metadata/a05e1621-f735-4837-bb86-ce9886da3e6b-m0.avro
├── Datafile:
s3a://warehouse/wh/default/test_null_nan/data/00000-0-658408d0-d063-4caa-b310-f68552713bea-00001.parquet
├── Datafile:
s3a://warehouse/wh/default/test_null_nan/data/00001-1-5e625fcb-4a0c-4082-9371-7f4897768ccd-00001.parquet
└── Datafile:
s3a://warehouse/wh/default/test_null_nan/data/00002-2-11de56ee-27c1-45ff-be61-cc52727c1b84-00001.parquet
```
If I filter using `pc.col('col_numeric').is_null(nan_is_null=True) &
~pc.col('col_numeric').is_null()` I don't get any results. When I rewrite the
table into a single file:
```sql
CREATE TABLE test_null_nan_rewritten
USING iceberg
AS SELECT * FROM test_null_nan
```
And then do the same filter operation, I do get results. I suspect there is
something off with the page skipping when `nan_is_null=True`.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]