andygrove opened a new issue, #4051: URL: https://github.com/apache/datafusion-comet/issues/4051
## Describe the bug Follow-up from [#4035](https://github.com/apache/datafusion-comet/pull/4035), which rejects non-default collated strings in shuffle, sort, and aggregate but does not cover the join paths. `CometHashJoin.doConvert` (used by both `CometHashJoinExec` and `CometBroadcastHashJoinExec`) and `CometSortMergeJoinExec.supportedSortMergeJoinEqualType` both accept any `StringType` as a join key without checking for non-default collation. Comet's native join performs byte-level key equality, so rows that compare equal under a collation (e.g., `'a'` vs `'A'` under `utf8_lcase`) will not match. For sort-merge join this is masked today by the shuffle-side fallback added in #4035: the shuffle on a collated key falls back to Spark, which forces the whole stage to Spark. Broadcast hash join has no shuffle to protect it, so incorrect results can escape. ## Steps to reproduce With Spark 4.0.1: ```sql SET spark.comet.exec.broadcastHashJoin.enabled = true; CREATE OR REPLACE TEMP VIEW l AS SELECT * FROM (VALUES ('a'), ('B')) AS t(c); CREATE OR REPLACE TEMP VIEW r AS SELECT * FROM (VALUES ('A'), ('b')) AS t(c); SELECT l.c, r.c FROM l JOIN r ON l.c COLLATE utf8_lcase = r.c COLLATE utf8_lcase; ``` Spark returns two matching rows; Comet (with broadcast hash join enabled) returns zero because the byte-level equality in the native join does not match `'a'` with `'A'`. ## Expected behavior Comet should fall back to Spark for hash joins and sort-merge joins whose equality keys have a non-default collation, either by rejecting the key types in `supportedSortMergeJoinEqualType` / a new equivalent check in `CometHashJoin.doConvert`, or by funneling through the existing `isStringCollationType` predicate on `CometTypeShim`. ## Additional context - Related PR: #4035 - Related issue: #1947 - Relevant code: - `spark/src/main/scala/org/apache/spark/sql/comet/operators.scala` `CometHashJoin.doConvert` (around line 1697) - `spark/src/main/scala/org/apache/spark/sql/comet/operators.scala` `supportedSortMergeJoinEqualType` (around line 2138) - `common/src/main/spark-4.0/org/apache/comet/shims/CometTypeShim.scala` `isStringCollationType` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
