neilconway opened a new pull request, #21352:
URL: https://github.com/apache/datafusion/pull/21352
## Which issue does this PR close?
- Closes #21199.
## Rationale for this change
This PR adds support for LEFT join semantics for lateral joins. This is a
bit tricky because of how it interacts with compensation for the "count bug".
This might be easiest to illustrate with an example; consider this query (Q1):
```sql
SELECT t1.id, sub.cnt
FROM t1 LEFT JOIN LATERAL (
SELECT count(*) AS cnt FROM t2 WHERE t2.t1_id = t1.id
) sub ON sub.cnt > 0
ORDER BY t1.id;
```
The initial decorrelation (Q2) is
```sql
SELECT t1.id, sub.cnt
FROM t1 LEFT JOIN (
SELECT count(*) AS cnt, t2.t1_id, TRUE AS __always_true
FROM t2 GROUP BY t2.t1_id
) sub ON t1.id = sub.t1_id
```
Ignoring the user's original `ON` clause for now. This initial query is
wrong, because `t1` rows that don't have a match in `t2` will get all-`NULL`
values, not `0` for `count(*)` of an empty set. This is the "count bug", and we
compensate for that by checking for rows when `__always_true` is `NULL`, and
replacing the agg value with the default for that agg (Q3):
```sql
SELECT t1.id,
CASE WHEN sub.__always_true IS NULL THEN 0
ELSE sub.cnt END AS cnt
FROM ( /* Q2 */ )
```
Now we just need to handle the user's original `ON` clause. We can't add
this to the rewritten `ON` clause in Q1, because we *don't* want the count-bug
compensation to fire. But we also can't just add it to the `WHERE` clause,
because we need left join semantics. So we can instead wrap another `CASE` that
re-checks the `ON` condition and substitutes `NULL` for every right-side column:
```sql
SELECT t1.id,
CASE WHEN (cnt > 0) IS NOT TRUE THEN NULL
ELSE cnt END AS cnt
FROM ( /* Q3 */ )
```
## What changes are included in this PR?
* Implement lateral left join rewrite as described above
* Update expected tests and add more test cases
* Update documentation
## Are these changes tested?
Yes.
## Are there any user-facing changes?
Support for a new feature.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]