stellarpower opened a new issue, #44608:
URL: https://github.com/apache/arrow/issues/44608

   ### Describe the enhancement requested
   
   Hi,
   
   Somewhat new to Arrow - I've used the basics before briefly and am aware it 
underpins many other tools I have used, but only needed to get my feet a little 
wet in that actual API use so far.
   
   I was taking some data fro ma file, and constructing a filter on it, and 
then I wanted to transform one of the columns by calling a function with the 
value for each row. I know there are Compute functions, and have also learned a 
bit about Gandiva - but I'm kinda surprised that after a few hours of googling 
I don't seem to have found a straightforward way of applying my own callable 
(i.e. a std::function, or similar) in the filtering pipeline, without going 
through a relatively lengthy process of registering it with the compute 
function registry and specifying a lot of boilerplate. Maybe this exists and I 
could be pointed in the right direction, but thus far I haven't seen anything 
indicating this would be currently possible. Form what I can gather, the R and 
Python packages do allow use of something like a lambda, albeit in the local 
language, in the filter pipelines, but for C++ I'm not aware of a way to do 
this.
   
   In my case, taking the [example code for 
filtering](https://arrow.apache.org/docs/cpp/dataset.html#filtering-data):
   ```C++
   ... Open a dataset here.
   
   // Read specified columns with a row filter
   ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
   ARROW_RETURN_NOT_OK(scan_builder->Project({"b"}));
   
   ARROW_RETURN_NOT_OK(scan_builder->Filter(
       cp::less(  cp::field_ref("b"), cp::literal(4)  )
   ));
   
   ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
   
   return scanner->ToTable();
   ```
   
   I'd like to be able to pass in some sort of native C++ callable, with 
relatively few lines, and have this called whilst iterating over the data:
   
   ```C++
   ... Open a dataset here.
   
   // Read specified columns with a row filter
   ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
   ARROW_RETURN_NOT_OK(scan_builder->Project({"b"}));
   
   ARROW_RETURN_NOT_OK(scan_builder->Filter(
       cp::makeFunction(
           [&]<Scalar>(const Scalar &cellValue){
               return someComplicatedObject->someComplicatedFunction("hello", 
"world", cellValue);
           }
       )
   ));
   
   ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());
   
   return scanner->ToTable();
   ```
   
   as just one hypothetical example of what this might look like.
   
   In my case the function is stateful - it's not a pure function, so couldn't 
be created through a combination of primitives in the compute library, and 
Gandiva whilst possible, I don't think will work that easily as exposing a C 
API would make things rather ugly. Also for this usecase whilst I like the idea 
of making a kernel through IR lowering, it's overkill for what I need it for, 
and I'd happily forego it in this scenario just to have the ease of giving 
Arrow a native callable and not having to generate a lot of boilerplate myself.
   
   I expect managing parallelism, as well as the way types are handled etc. 
would be a significant sticking point here - I don't yet know much about how 
arrow handles them, co-ercion, higher-level ocncepts like "numeric" as opposed 
to concrete types like doubles or size_ts, etc., but lso on this point Ithink:
   
   - the template system has the capacity to generate a lot of boilerplate if 
needed, so it oculd be stuff still needs to happen to register a function, but 
it can be generated for me
   - or I could also specify some kind of tuple that lists the Arrow datatypes 
he function could support, and it would perform the checks and then turn around 
and call the function with hte decltype of the callable,
   - I'd be happy with a runtime exception - or even a segfault, to be honest - 
if something goes wrong between the dataset#s schema and my function, and
   - as it's kinda being use in a specific scenario and I'm specifying a 
function directly, I don't believe it needs to be that universal.
   
   
   For example, because this is quite specific to the data I'm operating with, 
the datatypes and the structure of it, my gut feel is it would be fine for the 
function to work in more limited scenarios, and whilst the onus of managing the 
interface could be placed on Arrow, the onus of specifying the datatypes 
correctly or carving out the scenario under which this is well-formed can 
justifiably fall to the user. I think this sort of scenario matches with a 
lambda or a custom function call - if it's not generic but rather specific or 
custom logic that the user needs to perform, then it's less of an issue if it's 
somewhat tied down to how datatypes are handled and would error out or blow up 
if the actual data that come in aren't what was specified. My function only 
works on strings, it'd be an existing issue if the column I tried to call it on 
didn't contain strings, or were missing, and so in my mind limiting the scope 
and either assuming the data will be a string, or only supporting a
  string, that's preferable to having to register a custom function with a 
registry, and specify one or several different ways in which it could work, 
which would be overkill for my usecase.
   
   Similarly on parallelism, if my function is thread-safe, then I'd happily 
specify that myself, and if not, indicate that the data will need to be 
iterated on one core at most for this part of the filter.
   
   My current way of doing this is filtering some columns I don't need out of 
the dataset, then iterating over every row in bathes and adding an array of 
boolean flags to indicate if we keep that row or not, then creating a new table 
using new record batches with that appended column, filtering it again to 
remove the discarded rows, and finally projecting to remove that column also. 
This is a row filter, but I think the same idea holds for something like a map 
- let's say we wanted to take a numerical column with the brightness of an 
observed start and wanted to run some non-trivial calculation to estimate its 
mass or how far it might be from Earth. The dataset filtering code in the 
example is really nice and terse, and also quite readable, and I let arrow work 
out how to perform the actual implementation and just declare what I want to do 
in the data pipeline. So what I have is a lot more verbose and isn't ideal from 
that perspective. And I'm not using larger-than-memory data so it'
 s not an issue, but I expect it could be, or at least, would have to be 
planned more carefully, for perfoming a filter on enormous datasets.
   
   So, I don't know if this is possible, but, if there could be a way of 
managing it it'd be a very nice interface to have in the C++ APIs, and I think 
could potentially save a lot of time as well as add flexibility for a 
user-programmer writing data pipelines.
   
   Thanks!
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to