[GitHub] [arrow] Fokko opened a new issue, #34162: [Python] `is_null(nan_is_null=True)` does not work with only NaN's

via GitHub Mon, 13 Feb 2023 07:29:01 -0800


Fokko opened a new issue, #34162:
URL: https://github.com/apache/arrow/issues/34162


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I was working on some test-cases for the PyIceberg integration, and hit this 
edge case. When you have a file with only NaN values, it will be skipped when 
reading the file with a `is_null(nan_is_null=True)` filter.
   
   In Spark, I create the following table:
   
   ```sql
     CREATE TABLE test_null_nan 
     USING iceberg
     AS SELECT
       1            AS idx,
       float('NaN') AS col_numeric
   UNION ALL SELECT
       2            AS idx,
       null         AS col_numeric
   UNION ALL SELECT
       3            AS idx,
       1            AS col_numeric
   ```
   
   This then creates three files with each one record:
   
   ```
   ➜  python git:(fd-integration-tests) ✗ pyiceberg --catalog local files 
default.test_null_nan
   Snapshots: local.default.test_null_nan
   └── Snapshot 870844541941792785, schema 0: 
s3a://warehouse/wh/default/test_null_nan/metadata/snap-870844541941792785-1-a05e1621-f735-4837-bb86-ce9886da3e6b.avro
       └── Manifest: 
s3a://warehouse/wh/default/test_null_nan/metadata/a05e1621-f735-4837-bb86-ce9886da3e6b-m0.avro
           ├── Datafile: 
s3a://warehouse/wh/default/test_null_nan/data/00000-0-658408d0-d063-4caa-b310-f68552713bea-00001.parquet
           ├── Datafile: 
s3a://warehouse/wh/default/test_null_nan/data/00001-1-5e625fcb-4a0c-4082-9371-7f4897768ccd-00001.parquet
           └── Datafile: 
s3a://warehouse/wh/default/test_null_nan/data/00002-2-11de56ee-27c1-45ff-be61-cc52727c1b84-00001.parquet
   ```
   
   If I filter using `pc.col('col_numeric').is_null(nan_is_null=True) & 
~pc.col('col_numeric').is_null()` I don't get any results. When I rewrite the 
table into a single file:
   
   ```sql
     CREATE TABLE test_null_nan_rewritten
     USING iceberg
     AS SELECT * FROM test_null_nan
   ```
   
   And then do the same filter operation, I do get results. I suspect there is 
something off with the page skipping when `nan_is_null=True`.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] Fokko opened a new issue, #34162: [Python] `is_null(nan_is_null=True)` does not work with only NaN's

Reply via email to