[PR] fix: mark non-UTF8_BINARY collations as Incompatible for concat and reverse [datafusion-comet]

via GitHub Tue, 02 Jun 2026 08:04:16 -0700


andygrove opened a new pull request, #4567:
URL: https://github.com/apache/datafusion-comet/pull/4567


   ## Which issue does this PR close?
   
   Addresses the collation-related items (1, 2, and 3) of #4504. The remaining 
medium-priority items in that tracking issue (the `array/size.sql` MapType test 
and the `CometSize` defensive branch) are not collation-related and are left 
for a separate change, so this PR does not close #4504.
   
   ## Rationale for this change
   
   Spark 4.0 widened `Concat.allowedTypes` and the string branch of 
`Reverse.inputTypes` from `StringType` to `StringTypeWithCollation`, and both 
preserve the collation in their result type. Comet's native `concat`/`reverse` 
UDFs operate on raw bytes and produce `Utf8` (UTF8_BINARY semantics), so for a 
non-default (non-UTF8_BINARY) collation the native result silently diverges 
from Spark. Per the `audit-comet-expression` conventions, such Spark 4.0+ 
collation gaps must surface as `Incompatible(Some(reason))` so the divergence 
appears in EXPLAIN and the generated compatibility doc, and so the expression 
falls back to Spark by default.
   
   ## What changes are included in this PR?
   
   - **`CometConcat`** (item 1): now reports `Incompatible` when any child or 
the result carries a non-default collation. The string-input check uses 
`isInstanceOf[StringType]` instead of `== DataTypes.StringType`, because a 
collated `StringType` is not equal to the default `StringType` and was 
otherwise misreported via the unsupported-input-type branch. 
`getIncompatibleReasons` lists the new reason.
   - **`CometReverse`** (item 2): the non-array (string) branch now reports 
`Incompatible` when the child carries a non-default collation.
   - **`CometReverse.getIncompatibleReasons`** (item 3): now appends the 
string-collation reason to the array-branch reason so every 
`Incompatible(Some(r))` branch is represented (reaches EXPLAIN and the 
compatibility doc).
   
   Both reasons are concise and link to the collation tracking issue #2190.
   
   ## How are these changes tested?
   
   Added `expect_fallback` cases to 
`spark/src/test/resources/sql-tests/expressions/string/collation.sql` (already 
gated `MinSparkVersion: 4.0`) asserting that `concat`/`reverse` on `UTF8_LCASE` 
strings fall back to Spark with the expected reason. Verified on Spark 4.1; 
confirmed `concat`/`reverse` remain `Compatible` and the existing fixtures 
still pass on both Spark 3.5 and 4.1, and that the collation file is correctly 
skipped on 3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] fix: mark non-UTF8_BINARY collations as Incompatible for concat and reverse [datafusion-comet]

Reply via email to