szehon-ho commented on code in PR #13167: URL: https://github.com/apache/iceberg/pull/13167#discussion_r2153381495
########## spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/sql/TestStoragePartitionedJoins.java: ########## @@ -549,6 +555,146 @@ public void testJoinsWithMismatchingPartitionKeys() { tableName(OTHER_TABLE_NAME)); } + @TestTemplate + public void testJoinsCompatibleBucketNumbers() { + sql( + "CREATE TABLE %s (id BIGINT, int_col INT, dep STRING)" + + "USING iceberg " + + "PARTITIONED BY (bucket(4, id))" + + "TBLPROPERTIES (%s)", + tableName, tablePropsAsString(TABLE_PROPERTIES)); + + sql( + "INSERT INTO %s VALUES " + + "(1L, 100, 'software')," + + "(2L, 101, 'hr')," + + "(3L, 102, 'operation')," + + "(4L, 103, 'sales')," + + "(5L, 104, 'marketing')," + + "(6L, 105, 'pr')", + tableName); + + sql( + "CREATE TABLE %s (id BIGINT, int_col INT, dep STRING)" + + "USING iceberg " + + "PARTITIONED BY (bucket(6, id))" + + "TBLPROPERTIES (%s)", + tableName(OTHER_TABLE_NAME), tablePropsAsString(TABLE_PROPERTIES)); + + sql( + "INSERT INTO %s VALUES " + + "(1L, 100, 'software')," + + "(3L, 300, 'hardware')," + + "(4L, 103, 'sales')," + + "(5L, 104, 'marketing')," + + "(6L, 105, 'pr')", + tableName(OTHER_TABLE_NAME)); + + assertPartitioningAwarePlan( + 1, /* expected num of shuffles with SPJ */ + 3, /* expected num of shuffles without SPJ */ + "SELECT * " + + "FROM %s t1 " + + "INNER JOIN %s t2 " + + "ON t1.id = t2.id " + + "ORDER BY t1.id, t1.int_col, t1.dep, t2.id, t2.int_col, t2.dep", + tableName, + tableName(OTHER_TABLE_NAME)); + } + + @TestTemplate + public void testJoinsWithEqualBucketNumbers() { Review Comment: Actually I look again at the test and @huaxingao comment > Thanks @himadripal for the PR — it looks good overall. I've left a few minor comments. Could you please update the PR description to briefly explain the purpose of this change? Additionally, maybe also consider adding a test case to cover the scenario where reducer(...) returns null when thisNumBuckets == GCD — this would help ensure the no-op reducer path is exercised as expected. This test is not so useful right? There's already tests here for this case, shouldn't we have a test for a 1 bucket vs 3 bucket case? (gcd of 1 an 3 is 1, but we should not take 1 and do SPJ, iirc?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org