Re: [PR] Spark: Add schema conversion support for default values [iceberg]

via GitHub Mon, 27 Oct 2025 22:35:25 -0700


amogh-jahagirdar commented on code in PR #14407:
URL: https://github.com/apache/iceberg/pull/14407#discussion_r2467923334



##########
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/TypeToSparkType.java:
##########
@@ -69,9 +70,24 @@ public DataType struct(Types.StructType struct, 
List<DataType> fieldResults) {
       if (field.doc() != null) {
         sparkField = sparkField.withComment(field.doc());
       }
+      if (field.writeDefault() != null) {
+        // Convert Iceberg default value to Spark SQL string representation. 
Spark stores default
+        // values as SQL strings in column metadata. The .sql() method formats 
literals correctly
+        // for each type
+        Object writeDefault = SparkUtil.internalToSpark(field.type(), 
field.writeDefault());
+        sparkField =
+            
sparkField.withCurrentDefaultValue(Literal$.MODULE$.create(writeDefault, 
type).sql());
+      }
+      if (field.initialDefault() != null) {
+        // Same conversion for existence default values, used for existing 
rows when column is added
+        // to schema
+        Object initialDefault = SparkUtil.internalToSpark(field.type(), 
field.initialDefault());
+        sparkField =
+            sparkField.withExistenceDefaultValue(
+                Literal$.MODULE$.create(initialDefault, type).sql());
+      }
       sparkFields.add(sparkField);
     }
-

Review Comment:
   can we undo the unnecessary line change (I also think we should double check 
style above, we generally put newline after if block for readability)



##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkDefaultValues.java:
##########
@@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.sql;
+
+import static org.assertj.core.api.Assertions.assertThat;
+import static org.assertj.core.api.Assertions.assertThatThrownBy;
+
+import org.apache.iceberg.ParameterizedTestExtension;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.CatalogTestBase;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.AnalysisException;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.TestTemplate;
+import org.junit.jupiter.api.extension.ExtendWith;
+
+/**
+ * Tests for Spark SQL Default values integration with Iceberg default values.
+ *
+ * <p>Note: These tests use {@code validationCatalog.createTable()} to create 
tables with default
+ * values because the Iceberg Spark integration does not yet support default 
value clauses in Spark
+ * DDL. See {@link #testCreateTableWithDefaultsNotYetSupported()} and {@link
+ * #testAlterTableAddColumnWithDefaultNotYetSupported()} for verification that 
DDL with defaults
+ * currently throws exceptions.
+ *
+ * <p>Partial column INSERT statements (e.g., {@code INSERT INTO table (col1) 
VALUES (val1)}) are
+ * not supported for DSV2 in Spark 4.0 See {@link 
#testPartialInsertNotYetSupportedInSpark()} for

Review Comment:
   I would remove "See <test> for verification", it isn't that helpful of a 
comment imo.



##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/TestSparkSchemaUtil.java:
##########
@@ -21,17 +21,32 @@
 import static org.apache.iceberg.types.Types.NestedField.optional;
 import static org.assertj.core.api.Assertions.assertThat;
 
+import java.math.BigDecimal;
+import java.nio.ByteBuffer;
 import java.util.List;
+import java.util.TimeZone;
+import java.util.stream.Stream;
 import org.apache.iceberg.MetadataColumns;
 import org.apache.iceberg.Schema;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.types.Type;
 import org.apache.iceberg.types.Types;
 import org.apache.spark.sql.catalyst.expressions.AttributeReference;
 import org.apache.spark.sql.catalyst.expressions.MetadataAttribute;
 import org.apache.spark.sql.catalyst.types.DataTypeUtils;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
 import org.apache.spark.sql.types.StructType;
 import org.junit.jupiter.api.Test;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
 
 public class TestSparkSchemaUtil {
+
+  private static final String CURRENT_DEFAULT_COLUMN_METADATA_KEY = 
"CURRENT_DEFAULT";
+  private static final String EXISTS_DEFAULT_COLUMN_METADATA_KEY = 
"EXISTS_DEFAULT";

Review Comment:
   Are we not able to just directly access these from ResolveDefaultColumnUtils 
https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtils.scala#L26
 ?



##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkDefaultValues.java:
##########
@@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.sql;
+
+import static org.assertj.core.api.Assertions.assertThat;
+import static org.assertj.core.api.Assertions.assertThatThrownBy;
+
+import org.apache.iceberg.ParameterizedTestExtension;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.CatalogTestBase;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.AnalysisException;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.TestTemplate;
+import org.junit.jupiter.api.extension.ExtendWith;
+
+/**
+ * Tests for Spark SQL Default values integration with Iceberg default values.
+ *
+ * <p>Note: These tests use {@code validationCatalog.createTable()} to create 
tables with default
+ * values because the Iceberg Spark integration does not yet support default 
value clauses in Spark
+ * DDL. See {@link #testCreateTableWithDefaultsNotYetSupported()} and {@link
+ * #testAlterTableAddColumnWithDefaultNotYetSupported()} for verification that 
DDL with defaults
+ * currently throws exceptions.
+ *
+ * <p>Partial column INSERT statements (e.g., {@code INSERT INTO table (col1) 
VALUES (val1)}) are
+ * not supported for DSV2 in Spark 4.0 See {@link 
#testPartialInsertNotYetSupportedInSpark()} for
+ * verification.
+ */
+@ExtendWith(ParameterizedTestExtension.class)
+public class TestSparkDefaultValues extends CatalogTestBase {
+
+  @AfterEach
+  public void dropTestTable() {
+    sql("DROP TABLE IF EXISTS %s", tableName);
+  }
+
+  @TestTemplate
+  public void testWriteDefaultWithExplicitDefault() {
+    assertThat(validationCatalog.tableExists(tableIdent))
+        .as("Table should not already exist")
+        .isFalse();
+
+    Schema schema =
+        new Schema(
+            Types.NestedField.required(1, "id", Types.IntegerType.get()),
+            Types.NestedField.optional("data")
+                .withId(2)
+                .ofType(Types.StringType.get())
+                .withWriteDefault(Literal.of("default-data"))
+                .build());
+
+    validationCatalog.createTable(
+        tableIdent, schema, PartitionSpec.unpartitioned(), 
ImmutableMap.of("format-version", "3"));
+
+    sql("INSERT INTO %s VALUES (1, DEFAULT)", commitTarget());
+
+    assertEquals(
+        "Should insert row with default values",
+        ImmutableList.of(row(1, "default-data")),
+        sql("SELECT * FROM %s", selectTarget()));
+  }
+
+  @TestTemplate
+  public void testWriteDefaultForMultipleColumns() {
+    assertThat(validationCatalog.tableExists(tableIdent))
+        .as("Table should not already exist")
+        .isFalse();
+
+    Schema schema =
+        new Schema(
+            Types.NestedField.required(1, "id", Types.IntegerType.get()),
+            Types.NestedField.optional("bool_col")
+                .withId(2)
+                .ofType(Types.BooleanType.get())
+                .withWriteDefault(Literal.of(true))
+                .build(),
+            Types.NestedField.optional("int_col")
+                .withId(3)
+                .ofType(Types.IntegerType.get())
+                .withWriteDefault(Literal.of(42))
+                .build(),
+            Types.NestedField.optional("long_col")
+                .withId(4)
+                .ofType(Types.LongType.get())
+                .withWriteDefault(Literal.of(100L))
+                .build());
+
+    validationCatalog.createTable(
+        tableIdent, schema, PartitionSpec.unpartitioned(), 
ImmutableMap.of("format-version", "3"));
+
+    sql("INSERT INTO %s VALUES (1, DEFAULT, DEFAULT, DEFAULT)", 
commitTarget());
+
+    assertEquals(
+        "Should have expected default values",
+        ImmutableList.of(row(1, true, 42, 100L)),
+        sql("SELECT * FROM %s", selectTarget()));
+  }
+
+  @TestTemplate
+  public void testBulkInsertWithDefaults() {
+    assertThat(validationCatalog.tableExists(tableIdent))
+        .as("Table should not already exist")
+        .isFalse();
+
+    Schema schema =
+        new Schema(
+            Types.NestedField.required(1, "id", Types.IntegerType.get()),
+            Types.NestedField.optional("data")
+                .withId(2)
+                .ofType(Types.StringType.get())
+                .withWriteDefault(Literal.of("default_data"))
+                .build());
+
+    validationCatalog.createTable(
+        tableIdent, schema, PartitionSpec.unpartitioned(), 
ImmutableMap.of("format-version", "3"));
+
+    sql("INSERT INTO %s VALUES (1, DEFAULT), (2, DEFAULT), (3, DEFAULT)", 
commitTarget());
+
+    assertEquals(
+        "Should insert multiple rows with default values",
+        ImmutableList.of(row(1, "default_data"), row(2, "default_data"), 
row(3, "default_data")),
+        sql("SELECT * FROM %s ORDER BY id", selectTarget()));
+  }
+
+  @TestTemplate
+  public void testCreateTableWithDefaultsNotYetSupported() {
+    assertThat(validationCatalog.tableExists(tableIdent))
+        .as("Table should not already exist")
+        .isFalse();
+
+    assertThatThrownBy(
+            () ->
+                sql(
+                    "CREATE TABLE %s (id INT, data STRING DEFAULT 
'default-value') USING iceberg",
+                    tableName))
+        .isInstanceOf(AnalysisException.class)
+        .hasMessageContaining("does not support column default value");
+  }
+
+  @TestTemplate
+  public void testAlterTableAddColumnWithDefaultNotYetSupported() {
+    assertThat(validationCatalog.tableExists(tableIdent))
+        .as("Table should not already exist")
+        .isFalse();
+
+    Schema schema = new Schema(Types.NestedField.required(1, "id", 
Types.IntegerType.get()));
+
+    validationCatalog.createTable(
+        tableIdent, schema, PartitionSpec.unpartitioned(), 
ImmutableMap.of("format-version", "3"));
+
+    assertThatThrownBy(
+            () -> sql("ALTER TABLE %s ADD COLUMN data STRING DEFAULT 
'default-value'", tableName))
+        .isInstanceOf(UnsupportedOperationException.class)
+        .hasMessageContaining("default values in Spark is currently 
unsupported");
+  }
+
+  @TestTemplate
+  public void testPartialInsertNotYetSupportedInSpark() {

Review Comment:
   This won't be supported until 4.1, so I don't think "yet" even applies in 
this 4.0 module integration. I would just call this `testPartialInsertFails` or 
`testPartialInsertUnsupported`



##########
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkDefaultValues.java:
##########
@@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.sql;
+
+import static org.assertj.core.api.Assertions.assertThat;
+import static org.assertj.core.api.Assertions.assertThatThrownBy;
+
+import org.apache.iceberg.ParameterizedTestExtension;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.expressions.Literal;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.spark.CatalogTestBase;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.AnalysisException;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.TestTemplate;
+import org.junit.jupiter.api.extension.ExtendWith;
+
+/**
+ * Tests for Spark SQL Default values integration with Iceberg default values.
+ *
+ * <p>Note: These tests use {@code validationCatalog.createTable()} to create 
tables with default
+ * values because the Iceberg Spark integration does not yet support default 
value clauses in Spark
+ * DDL. See {@link #testCreateTableWithDefaultsNotYetSupported()} and {@link
+ * #testAlterTableAddColumnWithDefaultNotYetSupported()} for verification that 
DDL with defaults
+ * currently throws exceptions.
+ *
+ * <p>Partial column INSERT statements (e.g., {@code INSERT INTO table (col1) 
VALUES (val1)}) are
+ * not supported for DSV2 in Spark 4.0 See {@link 
#testPartialInsertNotYetSupportedInSpark()} for
+ * verification.
+ */
+@ExtendWith(ParameterizedTestExtension.class)
+public class TestSparkDefaultValues extends CatalogTestBase {
+
+  @AfterEach
+  public void dropTestTable() {
+    sql("DROP TABLE IF EXISTS %s", tableName);
+  }
+
+  @TestTemplate
+  public void testWriteDefaultWithExplicitDefault() {
+    assertThat(validationCatalog.tableExists(tableIdent))
+        .as("Table should not already exist")
+        .isFalse();
+
+    Schema schema =
+        new Schema(
+            Types.NestedField.required(1, "id", Types.IntegerType.get()),
+            Types.NestedField.optional("data")
+                .withId(2)
+                .ofType(Types.StringType.get())
+                .withWriteDefault(Literal.of("default-data"))
+                .build());
+
+    validationCatalog.createTable(
+        tableIdent, schema, PartitionSpec.unpartitioned(), 
ImmutableMap.of("format-version", "3"));
+
+    sql("INSERT INTO %s VALUES (1, DEFAULT)", commitTarget());
+
+    assertEquals(
+        "Should insert row with default values",
+        ImmutableList.of(row(1, "default-data")),
+        sql("SELECT * FROM %s", selectTarget()));
+  }
+
+  @TestTemplate
+  public void testWriteDefaultForMultipleColumns() {
+    assertThat(validationCatalog.tableExists(tableIdent))
+        .as("Table should not already exist")
+        .isFalse();
+
+    Schema schema =
+        new Schema(
+            Types.NestedField.required(1, "id", Types.IntegerType.get()),
+            Types.NestedField.optional("bool_col")
+                .withId(2)
+                .ofType(Types.BooleanType.get())
+                .withWriteDefault(Literal.of(true))
+                .build(),
+            Types.NestedField.optional("int_col")
+                .withId(3)
+                .ofType(Types.IntegerType.get())
+                .withWriteDefault(Literal.of(42))
+                .build(),
+            Types.NestedField.optional("long_col")
+                .withId(4)
+                .ofType(Types.LongType.get())
+                .withWriteDefault(Literal.of(100L))
+                .build());
+
+    validationCatalog.createTable(
+        tableIdent, schema, PartitionSpec.unpartitioned(), 
ImmutableMap.of("format-version", "3"));
+
+    sql("INSERT INTO %s VALUES (1, DEFAULT, DEFAULT, DEFAULT)", 
commitTarget());
+
+    assertEquals(
+        "Should have expected default values",
+        ImmutableList.of(row(1, true, 42, 100L)),
+        sql("SELECT * FROM %s", selectTarget()));
+  }
+
+  @TestTemplate
+  public void testBulkInsertWithDefaults() {
+    assertThat(validationCatalog.tableExists(tableIdent))
+        .as("Table should not already exist")
+        .isFalse();
+
+    Schema schema =
+        new Schema(
+            Types.NestedField.required(1, "id", Types.IntegerType.get()),
+            Types.NestedField.optional("data")
+                .withId(2)
+                .ofType(Types.StringType.get())
+                .withWriteDefault(Literal.of("default_data"))
+                .build());
+
+    validationCatalog.createTable(
+        tableIdent, schema, PartitionSpec.unpartitioned(), 
ImmutableMap.of("format-version", "3"));
+
+    sql("INSERT INTO %s VALUES (1, DEFAULT), (2, DEFAULT), (3, DEFAULT)", 
commitTarget());
+
+    assertEquals(
+        "Should insert multiple rows with default values",
+        ImmutableList.of(row(1, "default_data"), row(2, "default_data"), 
row(3, "default_data")),
+        sql("SELECT * FROM %s ORDER BY id", selectTarget()));
+  }
+
+  @TestTemplate
+  public void testCreateTableWithDefaultsNotYetSupported() {
+    assertThat(validationCatalog.tableExists(tableIdent))
+        .as("Table should not already exist")
+        .isFalse();
+
+    assertThatThrownBy(
+            () ->
+                sql(
+                    "CREATE TABLE %s (id INT, data STRING DEFAULT 
'default-value') USING iceberg",
+                    tableName))
+        .isInstanceOf(AnalysisException.class)
+        .hasMessageContaining("does not support column default value");
+  }
+
+  @TestTemplate
+  public void testAlterTableAddColumnWithDefaultNotYetSupported() {

Review Comment:
   Alter table isn't supported yet but we may want to add a test that does a 
schema evolution with some default value changes through the table APIs and 
verify the behavior on a subsequent write/read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark: Add schema conversion support for default values [iceberg]

Reply via email to