[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6965: Spark 3.3: Support write to branch through table identifier

via GitHub Mon, 06 Mar 2023 08:13:56 -0800


jackye1995 commented on code in PR #6965:
URL: https://github.com/apache/iceberg/pull/6965#discussion_r1126688745



##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMerge.java:
##########
@@ -214,11 +220,14 @@ public void 
testMergeIntoEmptyTargetInsertAllNonMatchingRows() {
             row(3, "emp-id-3") // new
             );
     assertEquals(
-        "Should have expected rows", expectedRows, sql("SELECT * FROM %s ORDER 
BY id", tableName));
+        "Should have expected rows",
+        expectedRows,
+        sql("SELECT * FROM %s ORDER BY id", selectTarget()));
   }
 
   @Test
   public void testMergeIntoEmptyTargetInsertOnlyMatchingRows() {
+    Assume.assumeTrue("Branch can only be used for non-empty table", branch == 
null);

Review Comment:
   Sorry had some errands and was not able to push everything during the 
weekend. 
   
   Yes this is a point I also want to discuss. I used `Assume` to basically 
highlight this user experience explicit, glad we are having the same attentions 
to this issue here. I left the NPE there because this requires further 
discussion anyway. I can fix once we settle.
   
   > when writing, we will automatically create a branch from the current table 
state if it does not already exist. That was intended for convenience, so you 
can set an environment branch and auto-create on write.
   
   > I'm leaning toward not allowing that for MERGE, UPDATE, or DELETE because 
it is so strange
   
   I also agree it's strange to create a new branch for operations like UPDATE 
and DELETE. However, if I think from the perspective of branching being an 
improved version of WAP, then it feels less strange because we are just saying 
the result of a write is staged in the specific branch until published.
   
   The difference is that, in the WAP case, it won't explicitly say `DELETE 
FROM table.branch_xxx` and the branch is supposed to be hidden as an 
environment setting. So feels like this goes back to the argument about using 
identifier vs using an environment branch. If we consider them to be 
equivalent, then my logic would be: **because it makes sense to auto-create the 
branch for the WAP case, we need to match the experience for the identifier 
case even though it feels a bit awkward from SQL language perspective.**
   
   This means we also support cases like `testMergeIntoEmptyTarget` for all 
branches. And extended further, following this logic, we will probably have to 
define **a non-existing branch to always be at the current table state**. So if 
user does `SELECT * from table.branch_not_exist` it will just read the current 
table instead of saying branch not found.
   
   I think what Ryan is saying is following the opposite side of the logic that 
the WAP case and identifier case are not equivalent. My only concern would be 
it that would reduce a lot of the usability of branch, especially in the 
current situation that we consider the environment setting based approach to be 
difficult to achieve. At the same time, I am not sure if we really gain more 
consistency and clarity in this approach because there are still a lot of rules 
people need to understand and they might just get equally confused.
   
   If we are saying special identifiers like `table.branch_`, 
`table.snapshot_id_`, etc. are hacks to work around query engine limitations, 
then it feels to me that we can accept some awkwardness to make it work for 
more use cases.
   
   Would really like to know what others think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6965: Spark 3.3: Support write to branch through table identifier

Reply via email to