Re: [PR] Spark: Reconcile derived partitioning from source table with target table specs in AddFilesProcedure [iceberg]

via GitHub Sat, 13 Apr 2024 15:48:15 -0700


amogh-jahagirdar commented on code in PR #10133:
URL: https://github.com/apache/iceberg/pull/10133#discussion_r1564298758



##########
spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java:
##########
@@ -948,6 +948,28 @@ public void testAddFilesWithParallelism() {
         sql("SELECT * FROM %s ORDER BY id", tableName));
   }
 
+  @TestTemplate
+  public void addFilesTargetTableEvolvedPartitioning() {

Review Comment:
   I think there's some fundamental changes around the derived schema from the 
Spark table, we'll need to do to get this to work as expected. While the fix 
addresses the particular case in the issue, here's a case which still will 
continue to not behave as expected.
   
   ```
       createIcebergTable("dept String, subdept String, id int, name String", 
"PARTITIONED BY (dept)");
   
       sql("ALTER TABLE %s ADD PARTITION FIELD subdept", tableName);
   
       String createParquet =
           "CREATE TABLE %s (dept String, subdept String, id int, name String) 
USING %s"
               + " PARTITIONED BY (dept, subdept) LOCATION '%s'";
   
       sql(createParquet, sourceTableName, "parquet", 
fileTableDir.getAbsolutePath());
       sql("INSERT INTO %s PARTITION (dept='hr', subdept='communications') 
VALUES (1, 'John Doe')", sourceTableName);
       sql("INSERT INTO %s PARTITION (dept='hr', subdept='salary') VALUES (2, 
'Jane Doe')", sourceTableName);
       sql("INSERT INTO %s PARTITION (dept='hr', subdept='communications') 
VALUES (3, 'Matt Doe')", sourceTableName);
       sql("INSERT INTO %s PARTITION (dept='facilities', subdept='all') VALUES 
(4, 'Will Doe')", sourceTableName);
   
       sql("CALL %s.system.add_files('%s', '%s')", catalogName, tableName, 
sourceTableName);
   
       assertEquals(
           "Iceberg table contains correct data",
           sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", 
sourceTableName),
           sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", 
tableName));
   ```
   
   This case still will fail with the change because we fall back to the 
derived spec; we fall back to the derived spec because the field IDs in the 
derived spec are different then what's in the target table. The field IDs 
generated when deriving the schema from the Spark table are assigned starting 
from 0 and are in different field ID order then what's on the derived spec. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Reconcile derived partitioning from source table with target table specs in AddFilesProcedure [iceberg]

Reply via email to