[GitHub] [iceberg] rdblue commented on a diff in pull request #6965: Spark 3.3: Support write to branch through table identifier

via GitHub Sun, 05 Mar 2023 16:59:00 -0800


rdblue commented on code in PR #6965:
URL: https://github.com/apache/iceberg/pull/6965#discussion_r1125772423



##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMerge.java:
##########
@@ -214,11 +220,14 @@ public void 
testMergeIntoEmptyTargetInsertAllNonMatchingRows() {
             row(3, "emp-id-3") // new
             );
     assertEquals(
-        "Should have expected rows", expectedRows, sql("SELECT * FROM %s ORDER 
BY id", tableName));
+        "Should have expected rows",
+        expectedRows,
+        sql("SELECT * FROM %s ORDER BY id", selectTarget()));
   }
 
   @Test
   public void testMergeIntoEmptyTargetInsertOnlyMatchingRows() {
+    Assume.assumeTrue("Branch can only be used for non-empty table", branch == 
null);

Review Comment:
   Looking into these test assumptions made me think more about what should be 
valid.
   
   Currently, all of the test cases use a branch for all reads and writes, so 
there isn't anything testing implicit branching from the current table state. 
That is when writing, we will automatically create a branch from the current 
table state if it does not already exist. That was intended for convenience, so 
you can set an environment branch and auto-create on write.
   
   For commands like MERGE that use the current table state, I'm wondering if 
that makes any sense because the command needs to create a read for the 
branch's state, which doesn't yet exist.
   
   I'm leaning toward not allowing that for MERGE, UPDATE, or DELETE because it 
is so strange. Does that mean we shouldn't allow implicit branch creation for 
INSERT either? On one hand, that's what these tests rely on to write from the 
start into a branch. But on the other, there could be an expectation that 
writing to a branch that doesn't exist actually creates it from an empty state.
   
   So...
   1. I think we need tests for all of the operations that validate what 
happens when (a) the table is empty and an operation uses a branch that doesn't 
exist and (b) when the table is not empty and an operation uses a branch that 
doesn't exist
   2. I think that MERGE, UPDATE, and DELETE should fail right away with a 
"Branch does not exist: %s" error (currently this is an NPE in job planning)
   3. I think INSERT behavior should be clearly defined for branches, whether 
it is based on the current table state or an empty table state.
   
   @aokolnychyi, @jackye1995, @danielcweeks, @amogh-jahagirdar, @namrathamyske, 
what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a diff in pull request #6965: Spark 3.3: Support write to branch through table identifier

Reply via email to