andygrove commented on code in PR #4461:
URL: https://github.com/apache/datafusion-comet/pull/4461#discussion_r3319479032


##########
docs/source/contributor-guide/spark_expressions_support.md:
##########
@@ -566,33 +635,88 @@
 - [ ] regexp_extract_all
 - [ ] regexp_instr
 - [x] regexp_replace
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `RegExpReplace(subject, 
regexp, rep, pos)` with foldable `pos > 0`; uses Java `Pattern`. Comet supports 
only `pos = 1` (other offsets fall back) and injects a `'g'` flag because 
DataFusion's `regexp_replace` stops at the first match by default.
+  - Spark 4.0.1 (audited 2026-05-27): adds raw-string literal support at the 
parser level and `nullIntolerant: Boolean = true`; runtime semantics unchanged.
+  - Known limitation: regex semantics differ (Rust `regex` crate vs Java 
`Pattern`); `RegExp.isSupportedPattern` currently returns `false` for every 
pattern, so the path always requires 
`spark.comet.expression.regexp.allowIncompatible=true`.
 - [ ] regexp_substr
 - [x] repeat
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `StringRepeat(str, times)` 
with `nullSafeEval(s, n) = s.repeat(n)`; `UTF8String.repeat` returns the empty 
string for `n <= 0`. Comet casts `times` to `LongType` and delegates to 
DataFusion `repeat`.
+  - Spark 4.0.1 (audited 2026-05-27): adds `nullIntolerant: Boolean` field; 
`dataType` becomes `str.dataType` (collation-tracking). Semantics unchanged for 
`UTF8_BINARY`.
+  - Known divergence: DataFusion `repeat` throws on negative counts instead of 
returning the empty string Spark produces. Currently surfaced via 
`getCompatibleNotes` only 
(https://github.com/apache/datafusion-comet/issues/4462).
 - [x] replace
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `StringReplace(src, search, 
replace)`; when `search` is empty, Spark returns `src` unchanged (short-circuit 
on `search.numBytes == 0`).
+  - Spark 4.0.1 (audited 2026-05-27): routes through 
`CollationSupport.StringReplace.exec`; semantics unchanged for `UTF8_BINARY`.
+  - Known divergence: DataFusion `replace` inserts `to` between every 
character (and at both ends) when `search` is empty 
(https://github.com/apache/datafusion-comet/issues/3344). Currently the support 
level is `Compatible`.
 - [x] right
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `RuntimeReplaceable` with 
`replacement = If(IsNull(str), null, If(len <= 0, "", Substring(str, -len, 
len)))`; accepts `StringType` plus `IntegerType`. Comet serde rewrites positive 
`len` to a `Substring` proto with `start=-len, len=len`; for `len <= 0` it 
builds an `If(IsNull(str), null, "")` proto chain to preserve NULL propagation.
+  - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened with collation; 
uses `UnaryMinus(len, failOnError = false)` to avoid integer-overflow 
exceptions on `len = Int.MinValue`. Semantics unchanged for `UTF8_BINARY`.
+  - Known limitation: the literal-only `len` restriction is enforced inside 
`convert` via `withInfo` rather than declared in `getSupportLevel`.

Review Comment:
   we should fix this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to