andygrove opened a new pull request, #4146:
URL: https://github.com/apache/datafusion-comet/pull/4146
## Which issue does this PR close?
<!-- Closes #. -->
N/A
## Rationale for this change
`regexp_extract` is a common Spark SQL string function used to pull
substrings out of input strings via regex capture groups. Adding native support
lets queries that use it stay in Comet instead of falling back to Spark.
## What changes are included in this PR?
- New Rust UDF `spark_regexp_extract` in
`native/spark-expr/src/string_funcs/regexp_extract.rs`, backed by the `regex`
crate. Handles Utf8 and LargeUtf8 inputs (array and scalar), idx defaults to 1,
idx=0 returns the whole match, no match returns the empty string, an unmatched
optional group returns the empty string, null input returns null, and an
out-of-range idx returns an execution error.
- Registration of `regexp_extract` in `comet_scalar_funcs.rs`.
- New `CometRegExpExtract` serde mapping `RegExpExtract` to the native UDF.
Reported as `Incompatible` because the Rust regex engine has different
semantics from Java's regex engine (POSIX classes, look-around, possessive
quantifiers, etc.). Users opt in via
`spark.comet.expression.RegExpExtract.allowIncompatible=true`. Falls back when
the pattern or `idx` is non-literal.
## How are these changes tested?
- 9 Rust unit tests in `regexp_extract.rs` covering basic group extraction,
idx=0/default idx, null subject, null pattern, unmatched optional group,
out-of-range idx, negative idx, and invalid regex.
- Two new Comet SQL test files:
- `regexp_extract.sql` verifies that the expression falls back to Spark by
default.
- `regexp_extract_enabled.sql` exercises the happy path with
`allowIncompatible=true`, including default and explicit idx, idx=0, no-match,
null input, optional unmatched groups, anchors, and an all-literal expression.
Run under a `ConfigMatrix` for both dictionary-encoded and plain Parquet input.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]