Re: [PR] infra: use spark base image for docker [iceberg-python]

via GitHub Wed, 01 Oct 2025 10:12:40 -0700


kevinjqliu commented on code in PR #2540:
URL: https://github.com/apache/iceberg-python/pull/2540#discussion_r2395293508



##########
dev/provision.py:
##########
@@ -23,35 +22,27 @@
 from pyiceberg.schema import Schema
 from pyiceberg.types import FixedType, NestedField, UUIDType
 
-# The configuration is important, otherwise we get many small
-# parquet files with a single row. When a positional delete
-# hits the Parquet file with one row, the parquet file gets
-# dropped instead of having a merge-on-read delete file.
-spark = (
-    SparkSession
-        .builder
-        .config("spark.sql.shuffle.partitions", "1")
-        .config("spark.default.parallelism", "1")

Review Comment:
   Looks like spark connect doesnt support these options
   ```
           .config("spark.sql.shuffle.partitions", "1")
           .config("spark.default.parallelism", "1")
   ```
   
   And `INSERT INTO` writes 1 data file per row. In order to force a single 
data file, im using 
   ```
   .coalesce(1).writeTo(identifier).append()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] infra: use spark base image for docker [iceberg-python]

Reply via email to