theirix commented on code in PR #21883:
URL: https://github.com/apache/datafusion/pull/21883#discussion_r3156330752
##########
datafusion/physical-expr/src/expressions/binary/kernels.rs:
##########
@@ -204,6 +204,95 @@ pub fn concat_elements_utf8view(
Ok(result.finish())
}
+/// Concatenates two `GenericBinaryArray`s element-wise.
+/// If either element is `Null`, the result element is also `Null`.
+///
+/// # Errors
+/// - Returns an error if the input arrays have different lengths.
+/// - Panics if any concatenated string exceeds `T::Offset::MAX` in length.
Review Comment:
This went away
##########
datafusion/physical-expr/src/expressions/binary/kernels.rs:
##########
@@ -204,6 +204,95 @@ pub fn concat_elements_utf8view(
Ok(result.finish())
}
+/// Concatenates two `GenericBinaryArray`s element-wise.
+/// If either element is `Null`, the result element is also `Null`.
+///
+/// # Errors
+/// - Returns an error if the input arrays have different lengths.
+/// - Panics if any concatenated string exceeds `T::Offset::MAX` in length.
+pub fn concat_elements_binary_array<T: OffsetSizeTrait>(
+ left: &GenericBinaryArray<T>,
+ right: &GenericBinaryArray<T>,
+) -> std::result::Result<GenericBinaryArray<T>, ArrowError> {
+ if left.len() != right.len() {
+ return Err(ArrowError::ComputeError(format!(
+ "Arrays must have the same length: {} != {}",
+ left.len(),
+ right.len()
+ )));
+ }
+ // data capacity is unknown, so pass zero
+ let mut result = GenericBinaryBuilder::<T>::with_capacity(left.len(), 0);
+
+ // Avoid reallocations by writing to a reused buffer (note we could be even
+ // more efficient by creating the view directly here and avoid the buffer
+ // but that would be more complex)
+ let mut buffer = MutableBuffer::new(0);
+
+ // Pre-compute combined null bitmap, so the per-row NULL check is more
+ // efficient
+ let nulls = NullBuffer::union(left.nulls(), right.nulls());
+
+ for i in 0..left.len() {
+ if nulls.as_ref().is_some_and(|n| n.is_null(i)) {
+ result.append_null();
+ } else {
+ let l = left.value(i);
+ let r = right.value(i);
+ buffer.clear();
+ buffer.extend_from_slice(l);
+ buffer.extend_from_slice(r);
+ // No try-version of append_value because it panics on overflow
Review Comment:
This went away
##########
datafusion/expr-common/src/type_coercion/binary.rs:
##########
@@ -1589,9 +1589,18 @@ fn ree_coercion(
/// This is a union of string coercion rules and specified rules:
/// 1. At least one side of lhs and rhs should be string type (Utf8 /
LargeUtf8)
/// 2. Data type of the other side should be able to cast to string type
+/// 3. Binary and string types cannot be mixed
fn string_concat_coercion(lhs_type: &DataType, rhs_type: &DataType) ->
Option<DataType> {
Review Comment:
It is worth noting, as we decided to ban mixing according to [this
analysis](https://github.com/apache/datafusion/issues/12709#issuecomment-4194430343).
I'll add `api change` label and put an upgrade notice shortly.
I intend to put a separate PR to harmonise `concat` UDF behaviour (it allows
mixing) with the pipe operator's behaviour to avoid confusion
##########
datafusion/physical-expr/src/expressions/binary/kernels.rs:
##########
@@ -204,6 +204,95 @@ pub fn concat_elements_utf8view(
Ok(result.finish())
}
+/// Concatenates two `GenericBinaryArray`s element-wise.
+/// If either element is `Null`, the result element is also `Null`.
+///
+/// # Errors
+/// - Returns an error if the input arrays have different lengths.
+/// - Panics if any concatenated string exceeds `T::Offset::MAX` in length.
+pub fn concat_elements_binary_array<T: OffsetSizeTrait>(
Review Comment:
Thanks for pointing that out. I am reinventing the wheel. Changed to these
implementations, leaving only view concats. Is there any reason we don't have
StringViewArray/BinaryViewArray/GenericByteViewArray in arrow-string? PR if so?
##########
datafusion/sqllogictest/test_files/spark/string/concat.slt:
##########
@@ -135,3 +135,25 @@ query T
SELECT concat(x'636166c3', x'a968656c6c6f');
----
caféhello
+
+# UDF concatenation for valid UTF-8 arguments
+query T
+SELECT concat(x'636166c3a9', x'68656c6c6f');
+----
+caféhello
+
+# concat UDF allows invalid UTF-8 arguments, so it allows a valid UTF-8
sequence after concatenation
+query T
+SELECT concat(x'c3', x'a9');
+----
+é
+
+# concat UDF cannot form a valid UTF-8 sequence
+query error Execution error: invalid UTF-8 in binary literal
+SELECT concat(x'ff', x'af');
+
+# Mixed binary and text provide actual UTF-8 sequence, not '\xc3a9' as in
PostgreSQL
+query T
+SELECT concat(x'c3a9', 'hello');
Review Comment:
`concat` is more capable as in #20787. As per the other comment, `concat`
could be harmonised.
##########
datafusion/expr-common/src/type_coercion/binary/tests/comparison.rs:
##########
@@ -897,3 +897,74 @@ fn test_binary_comparison_string_numeric_coercion() ->
Result<()> {
}
Ok(())
}
+
+#[test]
+fn test_string_concat_coercion() -> Result<()> {
+ // Binary
+ test_coercion_binary_rule!(
+ DataType::Binary,
+ DataType::Binary,
+ Operator::StringConcat,
+ DataType::Binary
+ );
+ test_coercion_binary_rule!(
+ DataType::LargeBinary,
+ DataType::LargeBinary,
+ Operator::StringConcat,
+ DataType::LargeBinary
+ );
+ test_coercion_binary_rule!(
+ DataType::BinaryView,
+ DataType::BinaryView,
+ Operator::StringConcat,
+ DataType::BinaryView
+ );
+
+ // String
+ test_coercion_binary_rule!(
+ DataType::Utf8,
+ DataType::Utf8,
+ Operator::StringConcat,
+ DataType::Utf8
+ );
+ test_coercion_binary_rule!(
+ DataType::LargeUtf8,
+ DataType::LargeUtf8,
+ Operator::StringConcat,
+ DataType::LargeUtf8
+ );
+ test_coercion_binary_rule!(
+ DataType::Utf8View,
+ DataType::Utf8View,
+ Operator::StringConcat,
+ DataType::Utf8View
+ );
+
+ // Mixed string-binary
+ for dt in [DataType::Utf8, DataType::LargeUtf8, DataType::Utf8View] {
+ assert!(
+ BinaryTypeCoercer::new(&DataType::Binary, &Operator::StringConcat,
&dt,)
+ .get_input_types()
+ .is_err(),
+ "{}",
+ dt
+ );
+ assert!(
+ BinaryTypeCoercer::new(&dt, &Operator::StringConcat,
&DataType::Binary,)
+ .get_input_types()
+ .is_err(),
+ "{}",
+ dt
+ );
Review Comment:
Agree. By the way, it tests only coercion, the actual operation concat, as
Jeffrey spotted for LargeBinary, is done via slt
##########
datafusion/sqllogictest/test_files/spark/string/concat.slt:
##########
@@ -135,3 +135,25 @@ query T
SELECT concat(x'636166c3', x'a968656c6c6f');
----
caféhello
+
+# UDF concatenation for valid UTF-8 arguments
Review Comment:
My mistake, it was added to this branch as a part of another PR to showcase
the `concat` behaviour. Removing
##########
datafusion/sqllogictest/test_files/binary.slt:
##########
@@ -321,3 +321,20 @@ query T
SELECT split_part(CAST(binary AS VARCHAR), 'o', 2) FROM t WHERE binary =
X'466f6f';
----
(empty)
+
+# Pipe concatenation of binaries always provides a binary
+query ?
+SELECT x'636166c3a9' || x'68656c6c6f';
+----
+636166c3a968656c6c6f
+
+# Byte pipe operator is forbidden for mixed binary and text
Review Comment:
No, I slipped it, thanks!. Added unit and SLT tests
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]