arashandishgar opened a new issue, #46128: URL: https://github.com/apache/arrow/issues/46128
### Describe the enhancement requested The current compute function for casting from other String Types to StringView Types can be optimized in five field. I'm going to elaborate on each of them. <details> <summary>1-Slicing </summary> Running the following code ``` #define LOG(...) ARROW_LOGGER_INFO("", __VA_ARGS__) template <typename I, typename O> void CheckCast(int string_length, int offset, int offset_lengt) { using Builder = typename TypeTraits<I>::BuilderType; using ArrayType = typename TypeTraits<O>::ArrayType; Builder builder; for (int i = 0; i < string_length; i++) { if (i % 3 == 0) { ASSERT_OK(builder.AppendNull()); } else { ASSERT_OK(builder.Append(std::string(4, 'a'))); } } std::shared_ptr<Array> input; ASSERT_OK(builder.Finish(&input)); CastOptions options; options.to_type = TypeTraits<O>::type_singleton(); ASSERT_OK_AND_ASSIGN( auto datum, CallFunction("cast", {input->Slice(offset, offset_lengt)}, &options)); auto output = datum.array_as<ArrayType>(); auto data = output->data(); LOG("Size:", data->buffers[1]->size()); LOG("Offset:", data->offset); LOG("Length:", data->length); } TEST(Test, CastStringToStringView) { CheckCast<StringType, StringViewType>((1<<10)+1, 1<<10, 1); } ``` will print the following similar massage ``` /home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3289: Size:16400 /home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3290: Offset:1024 /home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3291: Length:1 ``` In other words, the compute function allocates 1024 elements for the view buffer for only one element. ( I think the main focus of the author of this function was to prevent copying of bitmap, However, in the slicing case it's better the bitmap buffer is copied) I've already changed the function to prevent extra allocation and I'd like to send a pull request to solve it. (the same scenario exists for casting from FixedSizeBinary to String/Binary View too). </details> <details> <summary>2-Checking Overflow</summary> The current implementation for casting to string view check overflowing after allocating buffers https://github.com/apache/arrow/blob/efaef48db06767d40dc0c08998a8b3fd25f2cbeb/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L397-L412 However, it's possible to check for overflow before validating UTF-8, as I believe there's no need to allocate a buffer if an overflow occurs. Additionally, there's no overflow check when casting from StringView types to String types, which could allow a UTF-8 string to exceed the maximum int32_t length. https://github.com/apache/arrow/blob/efaef48db06767d40dc0c08998a8b3fd25f2cbeb/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L344-L346 For example ``` #define LOG(...) ARROW_LOGGER_INFO("", __VA_ARGS__) template <typename I, typename O> void CheckCast(int string_length, int string_width,int offset, int offset_lengt) { using Builder = typename TypeTraits<I>::BuilderType; using ArrayType = typename TypeTraits<O>::ArrayType; Builder builder; for (int i = 0; i < string_length; i++) { ASSERT_OK(builder.Append(std::string(string_width, 'a'))); } std::shared_ptr<Array> input; ASSERT_OK(builder.Finish(&input)); CastOptions options; options.to_type = TypeTraits<O>::type_singleton(); ASSERT_OK_AND_ASSIGN( auto datum, CallFunction("cast", {input->Slice(offset, offset_lengt)}, &options)); auto output = datum.array_as<ArrayType>(); auto data = output->data(); LOG("Size:", data->buffers[1]->size()); LOG("Size:", data->buffers[2]->size()); LOG("Offset:", data->offset); LOG("Length:", data->length); } TEST(Test, CastStringToStringView) { CheckCast<StringViewType, StringType>(8, 1<<30,0,1<<30); } ``` and the massage is ``` /home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3285: Size:36 /home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3286: Size:8589934592 /home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3287: Offset:0 /home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3288: Length:8 ``` </details> <details> <summary>3-Large Strings</summary> I'm not sure whether the following behaviour should be called bug(Although the code mentioned in its commeent as unpractical case) https://github.com/apache/arrow/blob/efaef48db06767d40dc0c08998a8b3fd25f2cbeb/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L397-L411 The above logic prevents casting scenarios, such as 8 GB of data in a LargeString type (where each element's length is less than the maximum int32_t), to a StringView. I believe a simple implementation could be feasible, though I recognize it would overlook the key benefit of StringView types, which is to create a view of a data buffer while avoiding large data copies. ``` template <typename O, typename I> enable_if_t<is_base_binary_type<I>::value && is_binary_view_like_type<O>::value, Result<std::shared_ptr<ArrayData>>> AddStringToStringView(const ArraySpan& input) { using Builder = typename TypeTraits<O>::BuilderType; Builder builder; ARROW_RETURN_NOT_OK(VisitArraySpanInline<I>( input, [&](std::string_view s) { return builder.Append(s); }, [&]() { return builder.AppendNull(); })); std::shared_ptr<Array> array_out; RETURN_NOT_OK(builder.Finish(&array_out)); return array_out->data(); } ``` </details> <details> <summary>4-Memory Bloat</summary> https://github.com/apache/arrow/blob/efaef48db06767d40dc0c08998a8b3fd25f2cbeb/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L418-L441 The implementation above may lead to a state where it owns a buffer but utilizes only a portion of it. Is it worthwhile to create a garbage collector and transfer the data to a new buffer? </details> <details> <summary>5-dedublicate</summary> As mentioned in [1] (Section 4), a key feature of the String/Binary View is preventing duplication. I believe this approach could be useful when converting from Fixed to String/Binary View types to minimize memory usage. </details> 1- https://www.influxdata.com/blog/faster-queries-with-stringview-part-two-influxdb/ ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org