Re: [PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]

via GitHub Fri, 02 Feb 2024 02:48:48 -0800


wooyeong commented on code in PR #9455:
URL: https://github.com/apache/iceberg/pull/9455#discussion_r1475886118



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -405,15 +407,18 @@ public boolean equals(Object other) {
       return false;
     }
 
-    // use only name in order to correctly invalidate Spark cache
+    // use name only unless branch/snapshotId is given in order to correctly 
invalidate Spark cache
+    // when branch or snapshotId is given, it's time travel
     SparkTable that = (SparkTable) other;
-    return icebergTable.name().equals(that.icebergTable.name());
+    return icebergTable.name().equals(that.icebergTable.name())
+        && Objects.equals(snapshotId, that.snapshotId);

Review Comment:
   I've organized my thoughts.
   
   Currently, we are working on an issue that arises when comparing two 
different `SparkTable` objects.
   
   A `SparkTable` can be initialized in three ways:
   - `main` branch (when neither `branch` nor `snapshotId` is provided)
   - Specific branch (when a `branch` is provided)
   - Specific snapshot (when a `snapshotId` is provided)
   
   When a snapshotId is provided, since the snapshotId is immutable, we can 
simply compare the two snapshots. In other cases, it's when a branch is 
provided. (`main` (where `branch` is null) or specific branch is given)
   
   If the branches of the two SparkTables are different, I believe it is 
correct to consider them as different objects. This is because if a branch is 
specified, the user has intentionally created distinct table objects, and that 
intention should be respected. Therefore, using the name of the branch to 
compare is reasonable.
   
   I've looked into the impact of CachingCatalog and refreshEagerly.
   
   There is no issue when using CachingCatalog. When a branch is used, we 
obtain the `main` branch SparkTable and then use the `copyWithBranch` method to 
create a separate SparkTable object. The same applies when a snapshotId is used.
   
   ```sql
   -- test sql
   SELECT * FROM iceberg_except_test
   VERSION AS OF '2023-01-01'
   UNION ALL
   SELECT * FROM iceberg_except_test;
   
   -- injected logs
   SparkCatalog.loadTable(ident, version): identifier: iceberg_except_test, 
ver: 2023-01-01
   SparkCatalog.loadTable(ident): identifier: iceberg_except_test
   SparkCatalog.load: identifier: iceberg_except_test
   CachingCatalog.load: identifier: iceberg_except_test
   -> SparkTable -> SparkTable#copyWithSnapshotId
   
   SparkCatalog.loadTable(ident): identifier: iceberg_except_test
   SparkCatalog.load: identifier: iceberg_except_test
   CachingCatalog.load: identifier: iceberg_except_test
   -> SparkTable
   ```
   
   Lastly, `refreshEagerly` is only valid when using Cache. If Cache is not 
used, `refreshEagerly` becomes false, and a refresh occurs. I don't think it's 
necessary to check whether cache is being used when verifying the equality of 
two SparkTables. This is because even in the case of simply performing a 
SELECT, the results may vary depending on the timing, which is an expected 
outcome. Furthermore, the current Iceberg code does not consider refreshEagerly 
when comparing two SparkTables, so it won't be worse than it currently is.
   
   
   Therefore, to summarize my argument:
   
   - It is sufficient to compare the name, branch, and snapshotId.
     - Remove the current implementation of extracting snapshotId from the 
branch in this PR.
   - When using CachingCatalog, there is no impact because a separate instance 
is created using the copyWithXX method.
   - We don't need to consider the impact of the refreshEagerly option
   
   I look forward to hearing your feedback. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]

Reply via email to