kumarUjjawal commented on code in PR #21883:
URL: https://github.com/apache/datafusion/pull/21883#discussion_r3151881371


##########
datafusion/expr-common/src/type_coercion/binary.rs:
##########
@@ -1589,9 +1589,18 @@ fn ree_coercion(
 /// This is a union of string coercion rules and specified rules:
 /// 1. At least one side of lhs and rhs should be string type (Utf8 / 
LargeUtf8)
 /// 2. Data type of the other side should be able to cast to string type
+/// 3. Binary and string types cannot be mixed
 fn string_concat_coercion(lhs_type: &DataType, rhs_type: &DataType) -> 
Option<DataType> {

Review Comment:
   Before the change, `Binary || Utf8` returned a `Utf8` string, and `Binary || 
Binary` also returned `Utf8`. Now the first errors out, and the second returns 
`Binary`, wouldn't this be a breaking change for users who are hitting this in 
a query? 



##########
datafusion/expr-common/src/type_coercion/binary/tests/comparison.rs:
##########
@@ -897,3 +897,74 @@ fn test_binary_comparison_string_numeric_coercion() -> 
Result<()> {
     }
     Ok(())
 }
+
+#[test]
+fn test_string_concat_coercion() -> Result<()> {
+    // Binary
+    test_coercion_binary_rule!(
+        DataType::Binary,
+        DataType::Binary,
+        Operator::StringConcat,
+        DataType::Binary
+    );
+    test_coercion_binary_rule!(
+        DataType::LargeBinary,
+        DataType::LargeBinary,
+        Operator::StringConcat,
+        DataType::LargeBinary
+    );
+    test_coercion_binary_rule!(
+        DataType::BinaryView,
+        DataType::BinaryView,
+        Operator::StringConcat,
+        DataType::BinaryView
+    );
+
+    // String
+    test_coercion_binary_rule!(
+        DataType::Utf8,
+        DataType::Utf8,
+        Operator::StringConcat,
+        DataType::Utf8
+    );
+    test_coercion_binary_rule!(
+        DataType::LargeUtf8,
+        DataType::LargeUtf8,
+        Operator::StringConcat,
+        DataType::LargeUtf8
+    );
+    test_coercion_binary_rule!(
+        DataType::Utf8View,
+        DataType::Utf8View,
+        Operator::StringConcat,
+        DataType::Utf8View
+    );
+
+    // Mixed string-binary
+    for dt in [DataType::Utf8, DataType::LargeUtf8, DataType::Utf8View] {
+        assert!(
+            BinaryTypeCoercer::new(&DataType::Binary, &Operator::StringConcat, 
&dt,)
+                .get_input_types()
+                .is_err(),
+            "{}",
+            dt
+        );
+        assert!(
+            BinaryTypeCoercer::new(&dt, &Operator::StringConcat, 
&DataType::Binary,)
+                .get_input_types()
+                .is_err(),
+            "{}",
+            dt
+        );

Review Comment:
   it only loops over string types paired with `DataType::Binary`. The coercion 
rule actually covers 9 combinations (Binary/LargeBinary/BinaryView × 
Utf8/LargeUtf8/Utf8View). Could you extend the loop to cover all of them? 
Something like a nested loop over both lists would do it. Right now LargeBinary 
|| Utf8 and BinaryView || LargeUtf8 are not directly tested. wdyt?



##########
datafusion/sqllogictest/test_files/spark/string/concat.slt:
##########
@@ -135,3 +135,25 @@ query T
 SELECT concat(x'636166c3', x'a968656c6c6f');
 ----
 caféhello
+
+# UDF concatenation for valid UTF-8 arguments
+query T
+SELECT concat(x'636166c3a9', x'68656c6c6f');
+----
+caféhello
+
+# concat UDF allows invalid UTF-8 arguments, so it allows a valid UTF-8 
sequence after concatenation
+query T
+SELECT concat(x'c3', x'a9');
+----
+é
+
+# concat UDF cannot form a valid UTF-8 sequence
+query error Execution error: invalid UTF-8 in binary literal
+SELECT concat(x'ff', x'af');
+
+# Mixed binary and text provide actual UTF-8 sequence, not '\xc3a9' as in 
PostgreSQL
+query T
+SELECT concat(x'c3a9', 'hello');

Review Comment:
   `concat(x'c3a9', 'hello')` works and returns `éhello`,  whereas `x'c3a9' || 
'hello'` would errors,  should we track this in an issue?



##########
datafusion/physical-expr/src/expressions/binary/kernels.rs:
##########
@@ -204,6 +204,95 @@ pub fn concat_elements_utf8view(
     Ok(result.finish())
 }
 
+/// Concatenates two `GenericBinaryArray`s element-wise.
+/// If either element is `Null`, the result element is also `Null`.
+///
+/// # Errors
+/// - Returns an error if the input arrays have different lengths.
+/// - Panics if any concatenated string exceeds `T::Offset::MAX` in length.

Review Comment:
   shoudl we move it to Panic section? also the kernel below 
`concat_elements_binary_view_array` returns an error on overflow  while this 
one panics. Same overflow class, different behavior. Either pick one and apply 
it to both, or add a comment explaining why they differ.



##########
datafusion/physical-expr/src/expressions/binary/kernels.rs:
##########
@@ -204,6 +204,95 @@ pub fn concat_elements_utf8view(
     Ok(result.finish())
 }
 
+/// Concatenates two `GenericBinaryArray`s element-wise.
+/// If either element is `Null`, the result element is also `Null`.
+///
+/// # Errors
+/// - Returns an error if the input arrays have different lengths.
+/// - Panics if any concatenated string exceeds `T::Offset::MAX` in length.
+pub fn concat_elements_binary_array<T: OffsetSizeTrait>(
+    left: &GenericBinaryArray<T>,
+    right: &GenericBinaryArray<T>,
+) -> std::result::Result<GenericBinaryArray<T>, ArrowError> {
+    if left.len() != right.len() {
+        return Err(ArrowError::ComputeError(format!(
+            "Arrays must have the same length: {} != {}",
+            left.len(),
+            right.len()
+        )));
+    }
+    // data capacity is unknown, so pass zero
+    let mut result = GenericBinaryBuilder::<T>::with_capacity(left.len(), 0);
+
+    // Avoid reallocations by writing to a reused buffer (note we could be even
+    // more efficient by creating the view directly here and avoid the buffer
+    // but that would be more complex)
+    let mut buffer = MutableBuffer::new(0);
+
+    // Pre-compute combined null bitmap, so the per-row NULL check is more
+    // efficient
+    let nulls = NullBuffer::union(left.nulls(), right.nulls());
+
+    for i in 0..left.len() {
+        if nulls.as_ref().is_some_and(|n| n.is_null(i)) {
+            result.append_null();
+        } else {
+            let l = left.value(i);
+            let r = right.value(i);
+            buffer.clear();
+            buffer.extend_from_slice(l);
+            buffer.extend_from_slice(r);
+            // No try-version of append_value because it panics on overflow

Review Comment:
   nit: we could improve the comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to