andygrove opened a new pull request, #4594:
URL: https://github.com/apache/datafusion-comet/pull/4594

   ## Which issue does this PR close?
   
   Closes #4464
   Closes #4463
   
   ## Rationale for this change
   
   Two correctness issues surfaced by the string-expressions audit (#4461). In 
both cases, the serde reports `Compatible` while the underlying native path 
silently diverges from Spark — EXPLAIN, the auto-generated compatibility doc, 
and the dispatcher all see `Compatible` so the operator runs natively rather 
than falling back.
   
   * **#4464**: `bit_length` / `octet_length` report `Compatible(None)` for 
`BinaryType`, but DataFusion's `BitLengthFunc` / `OctetLengthFunc` use 
`Signature::coercible(... logical_string() ...)` and reject `Binary` at 
execution time. The result: `bit_length(<binary>)` and `octet_length(<binary>)` 
plan successfully under Comet, then surface as a native execution error rather 
than falling back cleanly. The sibling `length` already guards `BinaryType` via 
`CometLength`.
   
   * **#4463**: `translate` is wired as `CometScalarFunction("translate")` and 
reports `Compatible`, but DataFusion's `translate` (1) iterates over Unicode 
graphemes while Spark uses code points (so combining marks / ZWJ sequences 
disagree), and (2) substitutes U+0000 instead of treating it as a deletion 
sentinel like Spark's `StringTranslate.buildDict`.
   
   ## What changes are included in this PR?
   
   * `CometBitLength` / `CometOctetLength`: new serdes that gate `BinaryType` 
as `Unsupported(Some(...))` (mirroring the existing `CometLength` shape). The 
string path remains `Compatible`.
   * `CometStringTranslate`: new serde that returns `Incompatible(Some(...))` 
so the divergent native path only runs when the user opts in via 
`spark.comet.expression.StringTranslate.allowIncompatible=true`. The notes call 
out both divergences (graphemes vs code points, U+0000 deletion).
   
   ## How are these changes tested?
   
   * `bit_length.sql` / `octet_length.sql`: extended with `expect_fallback(... 
on BinaryType is not supported)` cases that confirm binary input falls back 
cleanly and Spark and Comet agree on the answer.
   * `string_translate.sql`: converted to `expect_fallback(is not fully 
compatible with Spark)` to assert the default-path fallback behaviour.
   * `string_translate_enabled.sql`: new fixture that sets 
`spark.comet.expression.StringTranslate.allowIncompatible=true` and exercises 
the native path on ASCII inputs where Spark and DataFusion agree.
   * `CometStringExpressionSuite` "length, reverse, instr, replace, translate" 
wrapped in `withSQLConf(...allowIncompatible=true)` so the existing translate 
assertion still runs natively.
   
   Verified passing on Spark 3.5: `CometSqlFileTestSuite expressions/string/` 
(45 tests succeeded), `CometStringExpressionSuite` (33 tests succeeded), 
`spotless:check`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to