andygrove opened a new pull request, #4170:
URL: https://github.com/apache/datafusion-comet/pull/4170
## Which issue does this PR close?
Closes #.
## Rationale for this change
Comet's existing `RLike` implementation uses Rust's `regex` crate, which has
known semantic differences from Java's `java.util.regex` (Unicode handling,
possessive quantifiers, lookbehind, etc). It is therefore reported as
`Incompatible` and only used when
`spark.comet.expr.allowIncompatible.regexp=true`.
This PR adds an alternative path: a generic mechanism for native expressions
to call back into the JVM via JNI to invoke a Scala-implemented `CometUDF`,
plus a first concrete user (`RegExpLikeUDF`) that runs
`java.util.regex.Pattern` for full Java-semantics compatibility. The trade-off
is one JNI roundtrip + Arrow C Data Interface marshalling per batch, in
exchange for byte-for-byte parity with Spark.
This is an internal compatibility mechanism, not a public user-facing UDF
API. The `CometUDF` trait, JNI bridge, and proto are scaffolding that future
PRs can reuse to fill other Spark-compatibility gaps (e.g. `RegExpReplace`,
`RegExpExtract`, `Like` with Java semantics).
Marked draft and prototype: scope is intentionally narrow (one expression,
one config flag, default off, original native path completely untouched). No
public API, no benchmarking, no productionisation. Posting for design feedback
before extending.
## What changes are included in this PR?
**JVM side** (`common/`):
- `CometUDF` trait: `def evaluate(inputs: Array[ValueVector]): ValueVector`
- `CometUdfBridge` Java class: static `evaluate(...)` JNI entry point that
imports inputs from FFI, dispatches to the cached UDF instance (per-thread),
exports the result back via FFI. Mirrors the existing `CometScalarSubquery`
static-method pattern.
- `RegExpLikeUDF`: concrete impl using `java.util.regex.Pattern` (cached per
pattern string).
**Native side** (`native/`):
- New proto message `JvmScalarUdf { class_name, args, return_type }` in
`expr.proto`.
- `CometUdfBridge` JNI handle in `jni-bridge`, registered in
`JVMClasses::init`.
- `JvmScalarUdfExpr` `PhysicalExpr` in `spark-expr/jvm_udf/`: evaluates
child exprs, exports inputs via Arrow C Data Interface, calls
`CometUdfBridge.evaluate` via `JVMClasses::with_env`, imports the result back
via `from_ffi`.
- New planner arm in `core/src/execution/planner.rs` constructing the
PhysicalExpr from proto.
**Wire-up:**
- New config `spark.comet.exec.regexp.useJVM` (default `false`).
- `CometRLike.convert` branches on the config: when on, emits `JvmScalarUdf`
proto pointing at `org.apache.comet.udf.RegExpLikeUDF` and reports
`Compatible`. When off, original Rust path.
- Drive-by qualification in `CometArrayExpressionSuite.scala` to
disambiguate `org.apache.spark.sql.functions.udf` from the new
`org.apache.comet.udf` package (Scala name resolution shadowing).
## How are these changes tested?
- `RegExpLikeUDFSuite` (Scala unit, no JNI): null handling, pattern caching,
Java regex semantics on `BitVector` output.
- `CometRegExpJvmSuite` (end-to-end, flag enabled): correctness of `rlike`
against literal patterns, plan-shape assertion confirming `CometProjectExec` is
selected (UDF accepted into native execution), and `checkSparkAnswer` parity
verification.
- Existing `CometExpressionSuite` rlike tests still pass with the flag off
(no change to default behavior).
This was scaffolded with the `superpowers:brainstorming`,
`superpowers:writing-plans`, and `superpowers:subagent-driven-development`
project skills (spec and plan retained on disk, not committed).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]