wooyeong commented on code in PR #9455: URL: https://github.com/apache/iceberg/pull/9455#discussion_r1475886118
########## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java: ########## @@ -405,15 +407,18 @@ public boolean equals(Object other) { return false; } - // use only name in order to correctly invalidate Spark cache + // use name only unless branch/snapshotId is given in order to correctly invalidate Spark cache + // when branch or snapshotId is given, it's time travel SparkTable that = (SparkTable) other; - return icebergTable.name().equals(that.icebergTable.name()); + return icebergTable.name().equals(that.icebergTable.name()) + && Objects.equals(snapshotId, that.snapshotId); Review Comment: I've organized my thoughts. Currently, we are working on an issue that arises when comparing two different `SparkTable` objects. A `SparkTable` can be initialized in three ways: - `main` branch (when neither `branch` nor `snapshotId` is provided) - Specific branch (when a `branch` is provided) - Specific snapshot (when a `snapshotId` is provided) When a snapshotId is provided, since the snapshotId is immutable, we can simply compare the two snapshots. In other cases, it's when a branch is provided. (`main` (where `branch` is null) or specific branch is given) If the branches of the two SparkTables are different, I believe it is correct to consider them as different objects. This is because if a branch is specified, the user has intentionally created distinct table objects, and that intention should be respected. Therefore, using the name of the branch to compare is reasonable. I've looked into the impact of CachingCatalog and refreshEagerly. There is no issue when using CachingCatalog. When a branch is used, we obtain the `main` branch SparkTable and then use the `copyWithBranch` method to create a separate SparkTable object. The same applies when a snapshotId is used. ```sql -- test sql SELECT * FROM iceberg_except_test VERSION AS OF '2023-01-01' UNION ALL SELECT * FROM iceberg_except_test; -- injected logs SparkCatalog.loadTable(ident, version): identifier: iceberg_except_test, ver: 2023-01-01 SparkCatalog.loadTable(ident): identifier: iceberg_except_test SparkCatalog.load: identifier: iceberg_except_test CachingCatalog.load: identifier: iceberg_except_test -> SparkTable -> SparkTable#copyWithSnapshotId SparkCatalog.loadTable(ident): identifier: iceberg_except_test SparkCatalog.load: identifier: iceberg_except_test CachingCatalog.load: identifier: iceberg_except_test -> SparkTable ``` Lastly, `refreshEagerly` is only valid when using Cache. If Cache is not used, `refreshEagerly` becomes false, and a refresh occurs. I don't think it's necessary to check whether cache is being used when verifying the equality of two SparkTables. This is because even in the case of simply performing a SELECT, the results may vary depending on the timing, which is an expected outcome. Furthermore, the current Iceberg code does not consider refreshEagerly when comparing two SparkTables, so it won't be worse than it currently is. Therefore, to summarize my argument: - It is sufficient to compare the name, branch, and snapshotId. - Remove the current implementation of extracting snapshotId from the branch in this PR. - When using CachingCatalog, there is no impact because a separate instance is created using the copyWithXX method. - We don't need to consider the impact of the refreshEagerly option I look forward to hearing your feedback. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org