rtbs-dev opened a new issue, #44615:
URL: https://github.com/apache/arrow/issues/44615

   ### Describe the enhancement requested
   
   I'm Coming from AwkwardArray and Polars use, trying to vectorize the 
equivalent of finding the byte offsets (or character spans) of all regex 
matches in an array of strings. 
   
   See [this 
discussion](https://stackoverflow.com/questions/11918314/how-do-i-find-the-offset-of-a-matching-string-using-re2)
 for the request's solution in re2 directly. Per the solution there, it seems 
the information would be contained in the `re2::StringPiece` data, which [this 
thread](https://github.com/google/re2/issues/394#issuecomment-1290946763) 
indicated is preferable anyway, due to memory duplication. I see something 
vaguely related brought up 
[here](https://github.com/apache/arrow/issues/15381), where `string_view` was 
vendored instead, though I don't see a way to access the view objects right 
now, via the results of `extract_regex`. 
   
   I do see the [struct getting 
returned](https://arrow.apache.org/docs/cpp/compute.html#string-component-extraction)
 is not a plain string, but adding span locations might mess with downstream 
users' type definitions or API contracts. Maybe new behavior could be added as 
an additional option? Alternatively, a new function `extract_regex_spans` would 
already make my life much easier, even if downstream libraries like Polars and 
AkwardArray have add new wrapper APIs for their code to support the behavior. 
   
   Am I missing something obvious? I most importantly want to avoid having to 
loop twice over every string (first to find the string match and then to find 
the location of the previous match) because that feels wasteful when the 
matches are discovered via their offset locations in the first place, right? 
   
   Thanks! 
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to