andygrove opened a new issue, #4051:
URL: https://github.com/apache/datafusion-comet/issues/4051

   ## Describe the bug
   
   Follow-up from 
[#4035](https://github.com/apache/datafusion-comet/pull/4035), which rejects 
non-default collated strings in shuffle, sort, and aggregate but does not cover 
the join paths.
   
   `CometHashJoin.doConvert` (used by both `CometHashJoinExec` and 
`CometBroadcastHashJoinExec`) and 
`CometSortMergeJoinExec.supportedSortMergeJoinEqualType` both accept any 
`StringType` as a join key without checking for non-default collation. Comet's 
native join performs byte-level key equality, so rows that compare equal under 
a collation (e.g., `'a'` vs `'A'` under `utf8_lcase`) will not match.
   
   For sort-merge join this is masked today by the shuffle-side fallback added 
in #4035: the shuffle on a collated key falls back to Spark, which forces the 
whole stage to Spark. Broadcast hash join has no shuffle to protect it, so 
incorrect results can escape.
   
   ## Steps to reproduce
   
   With Spark 4.0.1:
   
   ```sql
   SET spark.comet.exec.broadcastHashJoin.enabled = true;
   
   CREATE OR REPLACE TEMP VIEW l AS SELECT * FROM (VALUES ('a'), ('B')) AS t(c);
   CREATE OR REPLACE TEMP VIEW r AS SELECT * FROM (VALUES ('A'), ('b')) AS t(c);
   
   SELECT l.c, r.c
   FROM l JOIN r
   ON l.c COLLATE utf8_lcase = r.c COLLATE utf8_lcase;
   ```
   
   Spark returns two matching rows; Comet (with broadcast hash join enabled) 
returns zero because the byte-level equality in the native join does not match 
`'a'` with `'A'`.
   
   ## Expected behavior
   
   Comet should fall back to Spark for hash joins and sort-merge joins whose 
equality keys have a non-default collation, either by rejecting the key types 
in `supportedSortMergeJoinEqualType` / a new equivalent check in 
`CometHashJoin.doConvert`, or by funneling through the existing 
`isStringCollationType` predicate on `CometTypeShim`.
   
   ## Additional context
   
   - Related PR: #4035
   - Related issue: #1947
   - Relevant code:
     - `spark/src/main/scala/org/apache/spark/sql/comet/operators.scala` 
`CometHashJoin.doConvert` (around line 1697)
     - `spark/src/main/scala/org/apache/spark/sql/comet/operators.scala` 
`supportedSortMergeJoinEqualType` (around line 2138)
     - `common/src/main/spark-4.0/org/apache/comet/shims/CometTypeShim.scala` 
`isStringCollationType`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to