[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6965: Spark 3.3: Support write to branch through table identifier

via GitHub Wed, 01 Mar 2023 08:11:54 -0800


amogh-jahagirdar commented on code in PR #6965:
URL: https://github.com/apache/iceberg/pull/6965#discussion_r1121979187



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##########
@@ -1017,4 +1019,41 @@ public String unknown(
       return String.format("%s(%s) %s %s", transform, sourceName, direction, 
nullOrder);
     }
   }
+
+  public static Map<String, String> buildWriteOptions(String branch, 
LogicalWriteInfo info) {
+    ImmutableMap.Builder<String, String> builder =
+        ImmutableMap.<String, String>builder().putAll(info.options());
+
+    if (branch != null) {
+      String optionBranch = info.options().get(SparkReadOptions.BRANCH);
+      if (optionBranch == null) {
+        builder.put(SparkWriteOptions.BRANCH, branch);
+      } else if (!optionBranch.equals(branch)) {
+        throw new ValidationException(
+            "Cannot override branch to write more than once, received %s in 
identifier and %s in write options",
+            branch, optionBranch);
+      }
+    }
+
+    return builder.build();
+  }
+
+  public static CaseInsensitiveStringMap buildScanOptions(
+      String branch, CaseInsensitiveStringMap options) {
+    ImmutableMap.Builder<String, String> builder =
+        ImmutableMap.<String, String>builder().putAll(options);
+
+    if (branch != null) {
+      String optionBranch = options.get(SparkReadOptions.BRANCH);
+      if (optionBranch == null) {
+        builder.put(SparkReadOptions.BRANCH, branch);
+      } else if (!optionBranch.equals(branch)) {
+        throw new ValidationException(
+            "Cannot override branch to read more than once, received %s in 
identifier and %s in scan options",

Review Comment:
   Yeah so I think this is a point of discussion, my take was we deprecate the 
SparkReadOptions just so from a user experience perspective there's not a bunch 
of ways to query the head of a branch. Every operation (SQL reads and writes or 
Dataframe) has the branch in the identifier. The only addition for SQL is the 
time travel syntax.



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java:
##########
@@ -659,10 +659,7 @@ private Table load(Identifier ident) {
 
       Matcher branch = BRANCH.matcher(ident.name());
       if (branch.matches()) {
-        Snapshot branchSnapshot = table.snapshot(branch.group(1));
-        if (branchSnapshot != null) {
-          return new SparkTable(table, branchSnapshot.snapshotId(), 
!cacheEnabled);
-        }
+        return new SparkTable(table, branch.group(1), !cacheEnabled);

Review Comment:
   Do we want to keep the "branch_" prefix in the identifier? I just removed it 
in the previous PR and everything after the table part in the identifier is 
treated as a branch. 
https://github.com/apache/iceberg/pull/6651/files#diff-6028f80634c027dee9ea947b9f0e50a51b1e5ced1834b902f0f47303514d0e96L660
   
   I'm okay either way one benefit is that from a user experience point of view 
it's explicit it's a branch and will less likely collide with column 
identifier. It's just a bit more verbose.



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -252,12 +266,12 @@ public WriteBuilder newWriteBuilder(LogicalWriteInfo 
info) {
     Preconditions.checkArgument(
         snapshotId == null, "Cannot write to table at a specific snapshot: 
%s", snapshotId);
 
-    return new SparkWriteBuilder(sparkSession(), icebergTable, info);
+    return new SparkWriteBuilder(sparkSession(), icebergTable, branch, info);
   }
 
   @Override
   public RowLevelOperationBuilder 
newRowLevelOperationBuilder(RowLevelOperationInfo info) {
-    return new SparkRowLevelOperationBuilder(sparkSession(), icebergTable, 
info);
+    return new SparkRowLevelOperationBuilder(sparkSession(), icebergTable, 
branch, info);

Review Comment:
   
https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java#L290
 SparkTable#canDeleteUsingMetadata needs to be updated to use the branch as 
well during the scan



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6965: Spark 3.3: Support write to branch through table identifier

Reply via email to