andygrove opened a new pull request, #4567: URL: https://github.com/apache/datafusion-comet/pull/4567
## Which issue does this PR close? Addresses the collation-related items (1, 2, and 3) of #4504. The remaining medium-priority items in that tracking issue (the `array/size.sql` MapType test and the `CometSize` defensive branch) are not collation-related and are left for a separate change, so this PR does not close #4504. ## Rationale for this change Spark 4.0 widened `Concat.allowedTypes` and the string branch of `Reverse.inputTypes` from `StringType` to `StringTypeWithCollation`, and both preserve the collation in their result type. Comet's native `concat`/`reverse` UDFs operate on raw bytes and produce `Utf8` (UTF8_BINARY semantics), so for a non-default (non-UTF8_BINARY) collation the native result silently diverges from Spark. Per the `audit-comet-expression` conventions, such Spark 4.0+ collation gaps must surface as `Incompatible(Some(reason))` so the divergence appears in EXPLAIN and the generated compatibility doc, and so the expression falls back to Spark by default. ## What changes are included in this PR? - **`CometConcat`** (item 1): now reports `Incompatible` when any child or the result carries a non-default collation. The string-input check uses `isInstanceOf[StringType]` instead of `== DataTypes.StringType`, because a collated `StringType` is not equal to the default `StringType` and was otherwise misreported via the unsupported-input-type branch. `getIncompatibleReasons` lists the new reason. - **`CometReverse`** (item 2): the non-array (string) branch now reports `Incompatible` when the child carries a non-default collation. - **`CometReverse.getIncompatibleReasons`** (item 3): now appends the string-collation reason to the array-branch reason so every `Incompatible(Some(r))` branch is represented (reaches EXPLAIN and the compatibility doc). Both reasons are concise and link to the collation tracking issue #2190. ## How are these changes tested? Added `expect_fallback` cases to `spark/src/test/resources/sql-tests/expressions/string/collation.sql` (already gated `MinSparkVersion: 4.0`) asserting that `concat`/`reverse` on `UTF8_LCASE` strings fall back to Spark with the expected reason. Verified on Spark 4.1; confirmed `concat`/`reverse` remain `Compatible` and the existing fixtures still pass on both Spark 3.5 and 4.1, and that the collation file is correctly skipped on 3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
