deepyaman opened a new issue, #23220:
URL: https://github.com/apache/datafusion/issues/23220
### Describe the bug
When a non-deterministic/volatile function (e.g. `random()`, `uuid()`) is
computed once in a subquery and then referenced multiple times in the outer
projection, DataFusion >= 52.0.0 pushes the outer projection **into the
file-scan `DataSourceExec`** and **inlines the subquery alias**, turning the
single call into N independent calls.
Two references to what should be the same "locked-in" value then diverge.
This worked correctly in 51.0.0 and regressed in 52.0.0 (both 52.0.0 and 53.0.0
are affected).
It only reproduces with a **file scan** (Parquet/CSV); an in-memory
`MemTable` is not affected, which points at projection pushdown into the file
source.
### To Reproduce
datafusion-cli:
```sql
COPY (SELECT 1 AS id UNION ALL SELECT 2 UNION ALL SELECT 3) TO 't.parquet';
CREATE EXTERNAL TABLE t STORED AS PARQUET LOCATION 't.parquet';
EXPLAIN
SELECT s.r AS x, s.r AS y
FROM (SELECT random() AS r FROM t) AS s;
```
**51.0.0 — correct** (`random()` evaluated once, then reused):
```
ProjectionExec: expr=[r@0 as x, r@0 as y]
ProjectionExec: expr=[random() as r]
DataSourceExec: file_groups={...t.parquet}, file_type=parquet
```
**52.0.0 / 53.0.0 — incorrect** (`random()` inlined and duplicated):
```
DataSourceExec: file_groups={...t.parquet}, projection=[random() as x,
random() as y], file_type=parquet
```
Executing the query confirms `x != y` on 53.0.0, whereas `x == y` on 51.0.0.
### Expected behavior
A volatile/non-deterministic expression aliased in a subquery should be
evaluated **once** and reused by later references, as in 51.0.0. The optimizer
should not inline/duplicate a volatile expression when pushing a projection
into a scan (cf. #10337 for the CTE analogue).
### Additional context
- Regression introduced in **52.0.0** (51.0.0 correct; 52.0.0 and 53.0.0
affected).
- Reproduces with Parquet and CSV file scans; not with in-memory tables.
- Surfaced downstream in [Ibis](https://github.com/ibis-project/ibis), which
relies on the subquery-aliasing pattern to "lock in" `random()`/`uuid()` values
(`ibis/backends/tests/test_impure.py::test_impure_correlated` and
`::test_chained_selections`). Equivalent Ibis reproducer:
```python
import ibis
from ibis import _
con = ibis.datafusion.connect()
ibis.memtable({"id": [1, 2, 3]}).to_parquet("t.parquet") # file-backed; bug
needs a file scan
t = con.read_parquet("t.parquet")
expr = t.select(common=ibis.random()).select(x=_.common, y=_.common)
df = expr.execute()
print((df.x == df.y).all()) # True on 51.0.0, False on >= 52.0.0
```
---
*Generated-by: Claude Opus 4.8 <[email protected]>*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]