Fokko commented on code in PR #1879:
URL: https://github.com/apache/iceberg-python/pull/1879#discussion_r2044268567
##########
tests/integration/test_deletes.py:
##########
@@ -467,21 +467,19 @@ def
test_partitioned_table_positional_deletes_sequence_number(spark: SparkSessio
assert snapshots[2].summary == Summary(
Operation.OVERWRITE,
Review Comment:
When I change it into CoW, I get for snapshot summary 1 (the delete
performend by Spark):
```json
{
"spark.app.id": "local-1744714815877",
"added-data-files": "1",
"deleted-data-files": "1",
"added-records": "1",
"deleted-records": "2",
"added-files-size": "714",
"removed-files-size": "743",
"changed-partition-count": "1",
"total-records": "4",
"total-files-size": "1461",
"total-data-files": "2",
"total-delete-files": "0",
"total-position-deletes": "0",
"total-equality-deletes": "0",
"engine-version": "3.5.1",
"app-id": "local-1744714815877",
"engine-name": "spark",
"iceberg-version": "Apache Iceberg 1.8.0 (commit
c277c2014a1b37fe755cfe37f173b6465bb8cb73)"
}
```
Which seems correct:
```
(10, 100),
(10, 101), <- Deleted by Spark
(20, 200),
(20, 201),
(20, 202)
```
PyIceberg has a different approach, where this is an `Overwrite`, and first
creates a snapshot that rewrites the original data file, then appends a new
file with the new updated record.
To reproduce this, I just removed the `TBLPROPERTIES` to set MoR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]