jackye1995 commented on code in PR #6965:
URL: https://github.com/apache/iceberg/pull/6965#discussion_r1126688745
##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMerge.java:
##########
@@ -214,11 +220,14 @@ public void
testMergeIntoEmptyTargetInsertAllNonMatchingRows() {
row(3, "emp-id-3") // new
);
assertEquals(
- "Should have expected rows", expectedRows, sql("SELECT * FROM %s ORDER
BY id", tableName));
+ "Should have expected rows",
+ expectedRows,
+ sql("SELECT * FROM %s ORDER BY id", selectTarget()));
}
@Test
public void testMergeIntoEmptyTargetInsertOnlyMatchingRows() {
+ Assume.assumeTrue("Branch can only be used for non-empty table", branch ==
null);
Review Comment:
Sorry had some errands and was not able to push everything during the
weekend.
Yes this is a point I also want to discuss. I used `Assume` to basically
highlight this user experience explicit, glad we are having the same attentions
to this issue here. I left the NPE there because this requires further
discussion anyway. I can fix once we settle.
> when writing, we will automatically create a branch from the current table
state if it does not already exist. That was intended for convenience, so you
can set an environment branch and auto-create on write.
> I'm leaning toward not allowing that for MERGE, UPDATE, or DELETE because
it is so strange
I also agree it's strange to create a new branch for operations like UPDATE
and DELETE. However, if I think from the perspective of branching being an
improved version of WAP, then it feels less strange because we are just saying
the result of a write is staged in the specific branch until published.
The difference is that, in the WAP case, it won't explicitly say `DELETE
FROM table.branch_xxx` and the branch is supposed to be hidden as an
environment setting. So feels like this goes back to the argument about using
identifier vs using an environment branch. If we consider them to be
equivalent, then my logic would be: **because it makes sense to auto-create the
branch for the WAP case, we need to match the experience for the identifier
case even though it feels a bit awkward from SQL language perspective.**
This means we also support cases like `testMergeIntoEmptyTarget` for all
branches. And extended further, following this logic, we will probably have to
define **a non-existing branch to always be at the current table state**. So if
user does `SELECT * from table.branch_not_exist` it will just read the current
table instead of saying branch not found.
I think what Ryan is saying is following the opposite side of the logic that
the WAP case and identifier case are not equivalent. My only concern would be
it that would reduce a lot of the usability of branch, especially in the
current situation that we consider the environment setting based approach to be
difficult to achieve. At the same time, I am not sure if we really gain more
consistency and clarity in this approach because there are still a lot of rules
people need to understand and they might just get equally confused.
If we are saying special identifiers like `table.branch_`,
`table.snapshot_id_`, etc. are hacks to work around query engine limitations,
then it feels to me that we can accept some awkwardness to make it work for
more use cases.
Would really like to know what others think.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]