alamb opened a new issue, #23095:
URL: https://github.com/apache/datafusion/issues/23095

   ### Describe the bug
   
   The `unwrap_cast` optimization (in both the logical `ExprSimplifier` and the 
physical simplifier) still produces **incorrect query results** for narrowing 
casts even after #22837. #22837 fixed the specific case of timestamp precision 
narrowing, but the same class of bug remains for other many-to-one casts, most 
notably **decimal scale narrowing**.
   
   This is a follow-up to #22142 (which #22837 partially addresses) to track 
making the optimization conservative for the *general* problem rather than 
blocklisting one type family at a time.
   
   ### Reproducer (decimal, still broken after #22837)
   
   ```sql
   CREATE TABLE t(d DECIMAL(20,4)) AS VALUES (1.2345), (1.2399), (1.2500);
   
   -- 1.2345 truncates to 1.23 at scale 2, so this MUST return 1 row:
   SELECT * FROM t
   WHERE arrow_cast(d, 'Decimal128(20, 2)') = arrow_cast(1.23, 'Decimal128(20, 
2)');
   
   -- predicate is rewritten to `d = 1.2300` (exact) -> returns 0 rows. WRONG.
   +---+
   | d |
   +---+
   +---+
   0 row(s) fetched.
   ```
   
   The original predicate `CAST(d AS Decimal128(20,2)) = 1.23` means "any `d` 
that truncates to 1.23 at scale 2" — a *range* of source values. The rewrite 
collapses it to a single exact value `d = 1.2300`, dropping `1.2345` and 
`1.2399`.
   
   ### Background / summary
   
   The `unwrap_cast` optimization rewrites predicates of the form:
   
   ```
   CAST(col AS target_type) <op> literal_target
     -->  col <op> try_cast_literal_to_type(literal_target, col_type)
   ```
   
   This is only valid when the cast applied **to the column** is one-to-one 
(value-preserving / a widening). When the column-side cast is **many-to-one** 
(narrowing: timestamp precision, decimal scale, float→int, date↔timestamp, 
etc.), multiple distinct source values map to the same target value, so an 
exact-equality rewrite on the source column is semantically wrong — the correct 
source-domain preimage is a *range*, not a singleton.
   
   This matches Spark's `UnwrapCastInBinaryComparison`, which has a two-layer 
check (see #22142):
   - **Layer 1** — a check on the *data types* (is this cast direction safe to 
unwrap at all?)
   - **Layer 2** — a check that the *literal* can be cast losslessly
   
   DataFusion currently only has the equivalent of Layer 2 
(`try_cast_literal_to_type`) and is missing the general Layer 1 type-direction 
check. #22837 added a narrow Layer-1-style guard for timestamp precision 
narrowing only; this ticket tracks the general fix.
   
   ### Why this doesn't blow up more often
   
   The optimization mostly stays correct today **by accident**, because it 
leans on `try_cast_literal_to_type` (`datafusion/expr-common/src/casts.rs`) 
being conservative. That helper is a restricted, *value-preserving* cast of the 
**literal** that returns `None` when the value can't be represented exactly in 
the source type (out of range, decimal digits lost, etc.). For many lossy casts 
the literal simply fails to round-trip, `try_cast_literal_to_type` returns 
`None`, and the unwrap is silently skipped — so no wrong answer.
   
   The bug surfaces precisely in the cases where:
   1. the **literal** *does* round-trip exactly into the source type (so Layer 
2 passes), **but**
   2. the **column-side cast is many-to-one**, so exact equality on the source 
column is still wrong.
   
   Decimal scale narrowing is exactly this trap: casting the literal `1.23` 
(scale 2) *up* to scale 4 → `1.2300` is exact and value-preserving, so Layer 2 
happily passes — but `CAST(d AS Decimal(20,2))` on the column is a truncating, 
many-to-one cast, so the rewrite is wrong. Timestamp precision narrowing (now 
fixed) had the same shape.
   
   So the right fix is a Layer-1 type-direction check (or a preimage-based 
rewrite) rather than relying on `try_cast_literal_to_type` alone.
   
   ### Proposed approaches (PRs from @discord9)
   
   - **#22837** (merged) — small blocklist fix for timestamp precision 
narrowing only.
   - **#21908** — replace default-allow with a *closed-by-default allowlist* of 
cast families proven safe (timestamp widening, same-sign integer widening, 
decimal widening, operator-gated int→string, etc.), still requiring exact 
literal round-trip.
   - **#22906** — a more general *cast-predicate-preimage* approach (analogous 
to the existing `udf_preimage`): represents the source-domain preimage as 
`Exact(ScalarValue)` or `Range(Interval)`, so many-to-one casts 
(timestamp/decimal narrowing) can be rewritten correctly into a half-open range 
(`col >= lo AND col < hi`) instead of being either blocked or wrongly collapsed 
to a singleton. Makes exact rewrites closed-by-default.
   
   ### Acceptance criteria
   
   - The decimal reproducer above returns the correct rows.
   - The optimization is conservative-by-default for *all* type families (not 
just timestamps), either via allowlist (#21908) or preimage (#22906).
   - Coverage for the other affected families called out in #22142 (int↔uint, 
float→int, date↔timestamp, dictionary, string round-trip).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to