neilconway commented on code in PR #21877:
URL: https://github.com/apache/datafusion/pull/21877#discussion_r3149575463
##########
datafusion/functions/src/unicode/substrindex.rs:
##########
@@ -261,30 +261,37 @@ fn visible_string_bytes<T: OffsetSizeTrait>(
offsets[offsets.len() - 1].as_usize() - offsets[0].as_usize()
}
-fn substr_index_general<'a, S, B>(
+fn substr_index_general<'a, S, O>(
string_array: S,
delimiter_array: S,
count_array: &PrimitiveArray<Int64Type>,
- mut builder: B,
+ mut builder: GenericStringArrayBuilder<O>,
) -> Result<ArrayRef>
where
S: StringArrayType<'a> + Copy,
- B: StringLikeArrayBuilder,
+ O: OffsetSizeTrait,
{
- for ((string, delimiter), n) in string_array
- .iter()
- .zip(delimiter_array.iter())
- .zip(count_array.iter())
- {
- match (string, delimiter, n) {
- (Some(string), Some(delimiter), Some(n)) => {
- builder.append_value(substr_index_slice(string, delimiter, n));
- }
- _ => builder.append_null(),
+ let num_rows = string_array.len();
+ // Output is null IFF any input is null.
+ let nulls = NullBuffer::union(
Review Comment:
Good question, @RatulDawar !
Empirically, taking this approach appears to be a significant performance
win (although in some cases the overhead of NULL handling might not be huge to
begin with, depending on the UDF). Intuitively, it makes sense that this would
be faster:
* Computing the union of two bitmaps can be accelerated with SIMD, and the
Arrow primitives are well-optimized. It's also an entirely sequential memory
access pattern.
* `append_value` in the bulk-NULL builder does not need to touch the NULL
bitmap for every row, whereas in the Arrow builder it does. That means fewer
per-row branches and also less data cache pressure.
* Similarly, we can now just check a single bit to look for NULLs, rather
than doing three conditionals. It's also easy for the compiler to hoist the
is-the-NULL-bitmap-None check outside of the loop and elide it entirely (I
haven't checked if LLVM is actually doing this but it could).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]