Re: [PR] Docs: Add Spark SQL Configurations [iceberg]

via GitHub Tue, 29 Apr 2025 13:55:28 -0700


RussellSpitzer commented on code in PR #12931:
URL: https://github.com/apache/iceberg/pull/12931#discussion_r2067417258



##########
docs/docs/spark-configuration.md:
##########
@@ -145,6 +145,61 @@ Using those SQL commands requires adding Iceberg 
extensions to your Spark enviro
 
 ## Runtime configuration
 
+### Precedence of Configuration Settings
+Iceberg allows configurations to be specified at different levels. The 
effective configuration for a read or write operation is determined based on 
the following order of precedence:
+
+1. DataSource API Read/Write Options – Explicitly passed to `.option(...)` in 
a read/write operation.
+
+2. Spark Session Configuration - Set globally in Spark via 
`spark.conf.set(...)`, `spark-defaults.conf`, or `--conf` in spark-submit.
+
+3. Table Properties – Defined on the Iceberg table via `ALTER TABLE SET 
TBLPROPERTIES`.
+
+4. Default Value
+
+If a setting is not defined at a higher level, the next level is used as 
fallback. This allows flexibility while enabling global defaults when needed.
+
+### Spark SQL Options
+
+Iceberg supports setting various global behaviors using Spark SQL 
configuration options. These can be set via `spark.conf`, `SparkSession 
settings`, or Spark submit arguments.
+For example:
+
+```scala
+// disabling vectorization
+val spark = SparkSession.builder()
+  .appName("IcebergExample")
+  .master("local[*]")
+  .config("spark.sql.catalog.my_catalog", 
"org.apache.iceberg.spark.SparkCatalog")
+  .config("spark.sql.extensions", 
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
+  .config("spark.sql.iceberg.vectorization.enabled", "false")
+  .getOrCreate()
+```
+
+| Spark option                                           | Default             
                                           | Description                        
                                    |
+|--------------------------------------------------------|----------------------------------------------------------------|------------------------------------------------------------------------|
+| spark.sql.iceberg.vectorization.enabled                | Table default       
                                           | Enables vectorized reads of data 
files                                       |
+| spark.sql.iceberg.parquet.reader-type                  | ICEBERG             
                                           | Sets Parquet reader implementation 
(`ICEBERG`,`COMET`)                 |
+| spark.sql.iceberg.check-nullability                    | true                
                                           | Whether to perform the nullability 
check during writes                 |
+| spark.sql.iceberg.check-ordering                       | true                
                                           | Whether to check the order of 
fields during writes                     |
+| spark.sql.iceberg.planning.preserve-data-grouping      | false               
                                           | Whether to preserve the existing 
grouping of data while planning splits |
+| spark.sql.iceberg.aggregate-push-down.enabled          | true                
                                           | Enables pushdown of aggregate 
functions (MAX, MIN, COUNT)              |
+| spark.sql.iceberg.distribution-mode                    | See [Spark 
Writes](spark-writes.md#writing-distribution-modes) | Controls distribution 
strategy during writes                           |
+| spark.wap.id                                           | null                
                                           | 
[Write-Audit-Publish](branching.md#audit-branch) snapshot staging ID   |
+| spark.wap.branch                                       | null                
                                           | WAP branch name for snapshot 
commit                                    |
+| spark.sql.iceberg.compression-codec                    | Table default       
                                           | Write compression codec (e.g., 
`zstd`, `snappy`)                       |
+| spark.sql.iceberg.compression-level                    | Table default       
                                           | Compression level for Parquet/Avro 
                                    |
+| spark.sql.iceberg.compression-strategy                 | Table default       
                                           | Compression strategy (for ORC)     
                                    |
+| spark.sql.iceberg.data-planning-mode                   | Table default       
                                           | Override for data planning mode    
                                    |

Review Comment:
   Default is AUTO



##########
docs/docs/spark-configuration.md:
##########
@@ -145,6 +145,61 @@ Using those SQL commands requires adding Iceberg 
extensions to your Spark enviro
 
 ## Runtime configuration
 
+### Precedence of Configuration Settings
+Iceberg allows configurations to be specified at different levels. The 
effective configuration for a read or write operation is determined based on 
the following order of precedence:
+
+1. DataSource API Read/Write Options – Explicitly passed to `.option(...)` in 
a read/write operation.
+
+2. Spark Session Configuration - Set globally in Spark via 
`spark.conf.set(...)`, `spark-defaults.conf`, or `--conf` in spark-submit.
+
+3. Table Properties – Defined on the Iceberg table via `ALTER TABLE SET 
TBLPROPERTIES`.
+
+4. Default Value
+
+If a setting is not defined at a higher level, the next level is used as 
fallback. This allows flexibility while enabling global defaults when needed.
+
+### Spark SQL Options
+
+Iceberg supports setting various global behaviors using Spark SQL 
configuration options. These can be set via `spark.conf`, `SparkSession 
settings`, or Spark submit arguments.
+For example:
+
+```scala
+// disabling vectorization
+val spark = SparkSession.builder()
+  .appName("IcebergExample")
+  .master("local[*]")
+  .config("spark.sql.catalog.my_catalog", 
"org.apache.iceberg.spark.SparkCatalog")
+  .config("spark.sql.extensions", 
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
+  .config("spark.sql.iceberg.vectorization.enabled", "false")
+  .getOrCreate()
+```
+
+| Spark option                                           | Default             
                                           | Description                        
                                    |
+|--------------------------------------------------------|----------------------------------------------------------------|------------------------------------------------------------------------|
+| spark.sql.iceberg.vectorization.enabled                | Table default       
                                           | Enables vectorized reads of data 
files                                       |
+| spark.sql.iceberg.parquet.reader-type                  | ICEBERG             
                                           | Sets Parquet reader implementation 
(`ICEBERG`,`COMET`)                 |
+| spark.sql.iceberg.check-nullability                    | true                
                                           | Whether to perform the nullability 
check during writes                 |
+| spark.sql.iceberg.check-ordering                       | true                
                                           | Whether to check the order of 
fields during writes                     |
+| spark.sql.iceberg.planning.preserve-data-grouping      | false               
                                           | Whether to preserve the existing 
grouping of data while planning splits |
+| spark.sql.iceberg.aggregate-push-down.enabled          | true                
                                           | Enables pushdown of aggregate 
functions (MAX, MIN, COUNT)              |
+| spark.sql.iceberg.distribution-mode                    | See [Spark 
Writes](spark-writes.md#writing-distribution-modes) | Controls distribution 
strategy during writes                           |
+| spark.wap.id                                           | null                
                                           | 
[Write-Audit-Publish](branching.md#audit-branch) snapshot staging ID   |
+| spark.wap.branch                                       | null                
                                           | WAP branch name for snapshot 
commit                                    |
+| spark.sql.iceberg.compression-codec                    | Table default       
                                           | Write compression codec (e.g., 
`zstd`, `snappy`)                       |
+| spark.sql.iceberg.compression-level                    | Table default       
                                           | Compression level for Parquet/Avro 
                                    |
+| spark.sql.iceberg.compression-strategy                 | Table default       
                                           | Compression strategy (for ORC)     
                                    |
+| spark.sql.iceberg.data-planning-mode                   | Table default       
                                           | Override for data planning mode    
                                    |
+| spark.sql.iceberg.delete-planning-mode                 | Table default       
                                           | Override for delete planning mode  
                                    |

Review Comment:
   Default is AUTO



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Docs: Add Spark SQL Configurations [iceberg]

Reply via email to