Shekharrajak opened a new issue, #3881:
URL: https://github.com/apache/datafusion-comet/issues/3881
### What is the problem the feature request solves?
Comet does not support ExistenceJoin, causing incorrect results for
correlated IN subqueries combined with OR on Spark 4.0. Adding native
ExistenceJoin support would allow Comet to handle these plans end-to-end and
eliminate the mixed Spark/Comet execution that produces wrong results.
The following query returns incorrect results when Comet is enabled on Spark
4.0:
```
CREATE TEMPORARY VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
CREATE TEMPORARY VIEW t2(c1, c2) AS VALUES (0, 2), (0, 3);
SELECT * FROM t1 WHERE
c1 IN (SELECT count(*) + 1 FROM t2 WHERE t2.c1 = t1.c1) OR
c2 IN (SELECT count(*) - 1 FROM t2 WHERE t2.c1 = t1.c1);
```
Expected: (0, 1) and (1, 2) (both rows match via different OR branches)
Actual with Comet: (1, 2) only (row (0, 1) is dropped)
The same query with a single IN (no OR) produces correct results.
### Describe the potential solution
_No response_
### Additional context
This only reproduces on Spark 4.0 because:
in-count-bug.sql test only exists in Spark 4.0
Spark 4.0 uses a new decorrelation path
(decorrelateInnerQueryEnabledForExistsIn = true by default) that produces
DomainJoin + ExistenceJoin plan structures
Spark 3.5 uses the old decorrelation which has the "count bug" (different
expected behavior)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]