syun64 commented on PR #955: URL: https://github.com/apache/iceberg-python/pull/955#issuecomment-2243845630
Proposed implementation is consistent with Spark Iceberg's behavior. For a given Iceberg table: ``` spark.read.table("demo.tacocat.test_null").show() >> +----+----+ >> | id|data| >> +----+----+ >> | 1| a| >> | 2| b| >> | 3| c| >> |NULL| d| >> | 5| e| >> +----+----+ ``` Scan API ignore Null values unless null is specified in the predicate expression: ``` spark.sql("""SELECT * FROM demo.tacocat.test_null WHERE id > 2""").show() >> +---+----+ >> | id|data| >> +---+----+ >> | 1| a| >> | 2| b| >> +---+----+ spark.sql("""SELECT * FROM demo.tacocat.test_null WHERE not id > 2""").show() >> +---+----+ >> | id|data| >> +---+----+ >> | 3| c| >> | 5| e| >> +---+----+ spark.sql("""SELECT * FROM demo.tacocat.test_null WHERE id > 2 OR id is NULL""").show() >> +----+----+ >> | id|data| >> +----+----+ >> | 3| c| >> |NULL| d| >> | 5| e| >> +----+----+ ``` While the DELETE API avoids deleting nulls unless it is specified directly in the predicate expression. ``` spark.sql("""DELETE FROM demo.tacocat.test_null WHERE id == 2""") spark.read.table("demo.tacocat.test_null").show() >> +----+----+ >> | id|data| >> +----+----+ >> | 1| a| >> | 3| c| >> |NULL| d| >> | 5| e| >> +----+----+ spark.sql("""DELETE FROM demo.tacocat.test_null WHERE id <= 2""") spark.read.table("demo.tacocat.test_null").show() >> +----+----+ >> | id|data| >> +----+----+ >> | 3| c| >> |NULL| d| >> | 5| e| >> +----+----+ spark.sql("""DELETE FROM demo.tacocat.test_null WHERE id <= 2 or id IS NULL""") spark.read.table("demo.tacocat.test_null").show() >> +---+----+ >> | id|data| >> +---+----+ >> | 3| c| >> | 5| e| >> +---+----+ ``` So I agree with @jqin61 's finding that we have to walk through the predicate expression to check if Nulls/NaNs were directly mentioned on delete to revert the expression correctly as proposed. Simple negation of `pyarrow.compute.Expression` will unfortunately yield the wrong outcome. ``` import pyarrow as pa import pyarrow.compute as pc expr = (pc.field("a") == pc.scalar(3)) tbl = pa.Table.from_pydict({"a": [1,2,3,None]}) tbl.filter(~ expr) >> pyarrow.Table >> a: int64 >> ---- >> a: [[1,2]] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org