arashandishgar opened a new issue, #46128:
URL: https://github.com/apache/arrow/issues/46128

   ### Describe the enhancement requested
   
   The current compute function for casting from other String Types to 
StringView Types can be optimized in five field. I'm going to elaborate on each 
of them.
   
   <details>
     <summary>1-Slicing </summary>
   Running the following code 
   
   ```
   #define LOG(...) ARROW_LOGGER_INFO("", __VA_ARGS__)
   template <typename I, typename O>
   void CheckCast(int string_length, int offset, int offset_lengt) {
     using Builder = typename TypeTraits<I>::BuilderType;
     using ArrayType = typename TypeTraits<O>::ArrayType;
     Builder builder;
     for (int i = 0; i < string_length; i++) {
       if (i % 3 == 0) {
         ASSERT_OK(builder.AppendNull());
       } else {
         ASSERT_OK(builder.Append(std::string(4, 'a')));
       }
     }
   
     std::shared_ptr<Array> input;
     ASSERT_OK(builder.Finish(&input));
     CastOptions options;
     options.to_type = TypeTraits<O>::type_singleton();
     ASSERT_OK_AND_ASSIGN(
         auto datum, CallFunction("cast", {input->Slice(offset, offset_lengt)}, 
&options));
     auto output = datum.array_as<ArrayType>();
     auto data = output->data();
     LOG("Size:", data->buffers[1]->size());
     LOG("Offset:", data->offset);
     LOG("Length:", data->length);
   }
   
   TEST(Test, CastStringToStringView) {
     CheckCast<StringType, StringViewType>((1<<10)+1, 1<<10, 1);
   }
   ```
   will print the following similar
    massage  
   ```
   
   
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3289:
 Size:16400
   
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3290:
 Offset:1024
   
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3291:
 Length:1
   ```
   In other words, the compute function allocates 1024 elements for the view 
buffer for only one element. ( I think the main focus of the author of this 
function was to prevent copying of bitmap, However, in the slicing case it's 
better the bitmap buffer is copied)
   I've already changed the function to prevent extra allocation and I'd like 
to send a pull request to solve it. (the same scenario exists for casting from 
FixedSizeBinary to String/Binary View too).
   
   </details>
   <details>
     <summary>2-Checking Overflow</summary>
   
   The current implementation for casting to string view check overflowing 
after allocating  buffers
   
https://github.com/apache/arrow/blob/efaef48db06767d40dc0c08998a8b3fd25f2cbeb/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L397-L412
   
   However, it's possible to check for overflow before validating UTF-8, as I 
believe there's no need to allocate a buffer if an overflow occurs. 
Additionally, there's no overflow check when casting from StringView types to 
String types, which could allow a UTF-8 string to exceed the maximum int32_t 
length.
   
   
https://github.com/apache/arrow/blob/efaef48db06767d40dc0c08998a8b3fd25f2cbeb/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L344-L346
   
   For example
   ```
   #define LOG(...) ARROW_LOGGER_INFO("", __VA_ARGS__)
   template <typename I, typename O>
   void CheckCast(int string_length, int string_width,int offset, int 
offset_lengt) {
     using Builder = typename TypeTraits<I>::BuilderType;
     using ArrayType = typename TypeTraits<O>::ArrayType;
     Builder builder;
     for (int i = 0; i < string_length; i++) {
         ASSERT_OK(builder.Append(std::string(string_width, 'a')));
     }
   
     std::shared_ptr<Array> input;
     ASSERT_OK(builder.Finish(&input));
     CastOptions options;
     options.to_type = TypeTraits<O>::type_singleton();
     ASSERT_OK_AND_ASSIGN(
         auto datum, CallFunction("cast", {input->Slice(offset, offset_lengt)}, 
&options));
     auto output = datum.array_as<ArrayType>();
     auto data = output->data();
     LOG("Size:", data->buffers[1]->size());
     LOG("Size:", data->buffers[2]->size());
     LOG("Offset:", data->offset);
     LOG("Length:", data->length);
   }
   
   TEST(Test, CastStringToStringView) {
     CheckCast<StringViewType, StringType>(8, 1<<30,0,1<<30);
   }
   ```
   and the massage is
   ```
   
   
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3285:
 Size:36
   
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3286:
 Size:8589934592
   
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3287:
 Offset:0
   
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3288:
 Length:8
   ```
   </details>
   
   
   <details>
     <summary>3-Large Strings</summary>
   I'm not sure whether the following behaviour should be called bug(Although 
the code mentioned in its commeent as unpractical case) 
   
   
https://github.com/apache/arrow/blob/efaef48db06767d40dc0c08998a8b3fd25f2cbeb/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L397-L411
   
   The above logic prevents casting scenarios, such as 8 GB of data in a 
LargeString type (where each element's length is less than the maximum 
int32_t), to a StringView. I believe a simple implementation could be feasible, 
though I recognize it would overlook the key benefit of StringView types, which 
is to create a view of a data buffer while avoiding large data copies.
     
   ```
   template <typename O, typename I>
   enable_if_t<is_base_binary_type<I>::value && 
is_binary_view_like_type<O>::value,
               Result<std::shared_ptr<ArrayData>>>
   AddStringToStringView(const ArraySpan& input) {
     using Builder = typename TypeTraits<O>::BuilderType;
     Builder builder;
     ARROW_RETURN_NOT_OK(VisitArraySpanInline<I>(
         input, [&](std::string_view s) { return builder.Append(s); },
         [&]() { return builder.AppendNull(); }));
     std::shared_ptr<Array> array_out;
     RETURN_NOT_OK(builder.Finish(&array_out));
     return array_out->data();
   }
   
   ```
   </details>
   
   <details>
     <summary>4-Memory Bloat</summary>
   
   
https://github.com/apache/arrow/blob/efaef48db06767d40dc0c08998a8b3fd25f2cbeb/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L418-L441
   
   The implementation above may lead to a state where it owns a buffer but 
utilizes only a portion of it. Is it worthwhile to create a garbage collector 
and transfer the data to a new buffer?
   
   </details>
   
   
   <details>
     <summary>5-dedublicate</summary>
   
   As mentioned in [1] (Section 4), a key feature of the String/Binary View is 
preventing duplication. I believe this approach could be useful when converting 
from Fixed to String/Binary View types to minimize memory usage.
   
   </details>
   
   
   1- 
https://www.influxdata.com/blog/faster-queries-with-stringview-part-two-influxdb/
   
   
   
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to