andygrove opened a new pull request, #4170:
URL: https://github.com/apache/datafusion-comet/pull/4170

   ## Which issue does this PR close?
   
   Closes #.
   
   ## Rationale for this change
   
   Comet's existing `RLike` implementation uses Rust's `regex` crate, which has 
known semantic differences from Java's `java.util.regex` (Unicode handling, 
possessive quantifiers, lookbehind, etc). It is therefore reported as 
`Incompatible` and only used when 
`spark.comet.expr.allowIncompatible.regexp=true`.
   
   This PR adds an alternative path: a generic mechanism for native expressions 
to call back into the JVM via JNI to invoke a Scala-implemented `CometUDF`, 
plus a first concrete user (`RegExpLikeUDF`) that runs 
`java.util.regex.Pattern` for full Java-semantics compatibility. The trade-off 
is one JNI roundtrip + Arrow C Data Interface marshalling per batch, in 
exchange for byte-for-byte parity with Spark.
   
   This is an internal compatibility mechanism, not a public user-facing UDF 
API. The `CometUDF` trait, JNI bridge, and proto are scaffolding that future 
PRs can reuse to fill other Spark-compatibility gaps (e.g. `RegExpReplace`, 
`RegExpExtract`, `Like` with Java semantics).
   
   Marked draft and prototype: scope is intentionally narrow (one expression, 
one config flag, default off, original native path completely untouched). No 
public API, no benchmarking, no productionisation. Posting for design feedback 
before extending.
   
   ## What changes are included in this PR?
   
   **JVM side** (`common/`):
   - `CometUDF` trait: `def evaluate(inputs: Array[ValueVector]): ValueVector`
   - `CometUdfBridge` Java class: static `evaluate(...)` JNI entry point that 
imports inputs from FFI, dispatches to the cached UDF instance (per-thread), 
exports the result back via FFI. Mirrors the existing `CometScalarSubquery` 
static-method pattern.
   - `RegExpLikeUDF`: concrete impl using `java.util.regex.Pattern` (cached per 
pattern string).
   
   **Native side** (`native/`):
   - New proto message `JvmScalarUdf { class_name, args, return_type }` in 
`expr.proto`.
   - `CometUdfBridge` JNI handle in `jni-bridge`, registered in 
`JVMClasses::init`.
   - `JvmScalarUdfExpr` `PhysicalExpr` in `spark-expr/jvm_udf/`: evaluates 
child exprs, exports inputs via Arrow C Data Interface, calls 
`CometUdfBridge.evaluate` via `JVMClasses::with_env`, imports the result back 
via `from_ffi`.
   - New planner arm in `core/src/execution/planner.rs` constructing the 
PhysicalExpr from proto.
   
   **Wire-up:**
   - New config `spark.comet.exec.regexp.useJVM` (default `false`).
   - `CometRLike.convert` branches on the config: when on, emits `JvmScalarUdf` 
proto pointing at `org.apache.comet.udf.RegExpLikeUDF` and reports 
`Compatible`. When off, original Rust path.
   - Drive-by qualification in `CometArrayExpressionSuite.scala` to 
disambiguate `org.apache.spark.sql.functions.udf` from the new 
`org.apache.comet.udf` package (Scala name resolution shadowing).
   
   ## How are these changes tested?
   
   - `RegExpLikeUDFSuite` (Scala unit, no JNI): null handling, pattern caching, 
Java regex semantics on `BitVector` output.
   - `CometRegExpJvmSuite` (end-to-end, flag enabled): correctness of `rlike` 
against literal patterns, plan-shape assertion confirming `CometProjectExec` is 
selected (UDF accepted into native execution), and `checkSparkAnswer` parity 
verification.
   - Existing `CometExpressionSuite` rlike tests still pass with the flag off 
(no change to default behavior).
   
   This was scaffolded with the `superpowers:brainstorming`, 
`superpowers:writing-plans`, and `superpowers:subagent-driven-development` 
project skills (spec and plan retained on disk, not committed).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to