Re: [PR] infra: use spark base image for docker [iceberg-python]

via GitHub Sat, 18 Oct 2025 12:49:44 -0700


Fokko commented on code in PR #2540:
URL: https://github.com/apache/iceberg-python/pull/2540#discussion_r2386230181



##########
dev/provision.py:
##########
@@ -23,35 +22,27 @@
 from pyiceberg.schema import Schema
 from pyiceberg.types import FixedType, NestedField, UUIDType
 
-# The configuration is important, otherwise we get many small
-# parquet files with a single row. When a positional delete
-# hits the Parquet file with one row, the parquet file gets
-# dropped instead of having a merge-on-read delete file.
-spark = (
-    SparkSession
-        .builder
-        .config("spark.sql.shuffle.partitions", "1")
-        .config("spark.default.parallelism", "1")

Review Comment:
   I think we've set these to avoid creating multiple files with just one row. 
This way, when a shuffle is performed, it will be coalesced into a single file. 
This affects tests such as positional deletes, because when all the rows in a 
single file are marked for deletion, the whole file is dropped instead of 
creating merge-on-read deletes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] infra: use spark base image for docker [iceberg-python]

Reply via email to