andygrove opened a new issue, #4155:
URL: https://github.com/apache/datafusion-comet/issues/4155

   ### Describe the bug
   
   On Spark 4.0, `try_url_decode(x)` should return `NULL` when `x` is not a 
valid percent-encoded string. Comet errors instead.
   
   ### Root cause
   
   Spark 4.0 rewrites `try_url_decode(x)` through Catalyst's 
`RuntimeReplaceable` chain:
   
   ```
   TryUrlDecode(x)
     -> UrlDecode(x, failOnError = false)
       -> StaticInvoke(UrlCodec, "decode", Seq(x, Literal(false)), 
Seq(StringTypeWithCollation, BooleanType))
   ```
   
   `CometUrlDecodeStaticInvoke` (in 
`spark/src/main/scala/org/apache/comet/serde/statics.scala`) only reads 
`expr.children.head` and emits `scalarFunctionExprToProto("url_decode", 
child)`. The `Literal(false)` carrying the `failOnError` flag is silently 
dropped. The native `SparkUrlDecode` (datafusion-spark) always errors on 
malformed percent-encoding, so the Comet path errors where Spark would have 
returned NULL.
   
   This affects only Spark 4.0+ (`try_url_decode` does not exist in 3.4 / 3.5).
   
   ### Reproduction (Spark 4.0)
   
   ```sql
   -- Spark returns NULL
   SELECT try_url_decode('http%3A%2F%2spark.apache.org');
   
   -- With Comet enabled, this throws "Invalid percent-encoding: ..." instead 
of returning NULL.
   ```
   
   ### Suggested fix
   
   Either of:
   
   1. Detect the Spark-4.0 `StaticInvoke(UrlCodec, "decode", Seq(child, 
Literal(false)), ...)` shape in `CometUrlDecodeStaticInvoke` and emit a 
`try_url_decode` scalar function. 
`datafusion_spark::function::url::try_url_decode::TryUrlDecode` already exists 
in the `datafusion-spark` crate but is not registered in 
`native/core/src/execution/jni_api.rs`. Registering it (similar to how 
`SparkTryParseUrl` is already registered) would make the dispatch trivial.
   2. Mark this StaticInvoke shape as `Unsupported` / `Incompatible` so Comet 
falls back to Spark whenever `failOnError=false`.
   
   ### Notes
   
   - For `url_decode(x)` (i.e. `failOnError=true`, the default), Comet's 
behavior matches Spark: both error on malformed input. The error class differs 
(`CANNOT_DECODE_URL` vs DataFusion's `Invalid percent-encoding`) but both fail.
   - This was discovered while auditing the `url_decode` expression added in 
#4152. A test for `url_decode` malformed input was added in that PR. A 
`try_url_decode` test should be added once this issue is fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to