zanmato1984 opened a new issue, #47374: URL: https://github.com/apache/arrow/issues/47374
### Describe the enhancement requested This is the umbrella issue for supporting special form expression in Arrow compute. ### Rationale It all starts from #41094. The comment https://github.com/apache/arrow/issues/41094#issuecomment-2087716483 is the rationale. To summarize a bit, consider expression `if (a != 0) then (b / a) else 0`. In current Arrow expression evaluation, it is a nested function call to `if_else(ne(a, 0), div(b, a), 0)`, which eagerly evaluates all its parameters including the `div(b, a)` even if `a` can be `0`, or more formally, this is a "strict binding" strategy [1]. This is unintuitive and problematic given the meaning of `if_else`. What we expect here is a non-strict binding strategy, "call-by-name" [2]. Most programming languages have exceptional support for a handful "special" expressions , e.g. Lisp calls those "special forms" [3]. And as a closer peer, Velox uses the same term [4]. (Thanks @felipecrv for all these references!) It is especially tricky to support such special forms in a vectorized engine like Arrow. There are two main challenges: * For conditional evaluation like `if cond then if_branch else else_branch`, all the arguments are evaluated on a batch basis. However which branch to evaluate (`if_branch` VS. `else_branch`) is decided by `cond` on a row basis. We need a "selection vector" during evaluation to mask which rows in the batch to evaluate and which not - this is about not only the efficiency, but also the "visible side effect" (consider `divide by zero` thrown in a `divisor != 0` branch). * In addition to the selection vector, the kernels must respect it. Currently none does. And it's almost impossible for us to make all existing kernels to respect the selection vector at once (and we probably never will). Thus we need an incremental way to add selection-vector-aware kernels on demand, meanwhile accommodate legacy (selection-vector-non-aware) kernels to be executed "selection-vector-aware"-ly in a general manner - the idea is to first "gather" selected rows from the batch into a new batch, evaluate the expression on the new batch, then "scatter" the result rows into the positions where it belongs in the original batch. ### Approach I did try to solve this in #42106 last year, however neither the reviewer nor myself were quite satisfied with the solution back then. Since then I've been reworking on this off and on. Until now I'm confident about the new approach. It's a large amount of work, so I split the mission into several separate yet self-consistent concrete tasks to ease the review and be more friendly to potential community help. - [ ] Issue # placeholder: Move `scatter` function into compute core. As mentioned above, the `take` ("gather") and `scatter` functions play an essential role to support selective execution for arbitrary legacy (selection-vector-non-aware) kernels, thus they have to exist in libarrow. Function `take` already does. We need `scatter`. - [ ] Issue # placeholder: Support "selective execution" for kernels. This is IMO the most challenging part. As mentioned above, we need an incremental way to add selection-vector-aware kernels on demand, meanwhile accommodate legacy (selection-vector-non-aware) kernels to be executed "selection-vector-aware"-ly in a general manner. - [ ] Issue # placeholder: Introduce structures of special form into Arrow compute C++, and probably add `if_else` as the first one because it's almost the most wanted one. The evaluation of the special form will generate selection vector on the fly and ingest it into the evaluation of its arguments, leveraging the selective execution for them. - [ ] Add selective kernels to avoid the overhead of "general selective execution" (gather/scatter): e.g. `divide` might be the most common one. - [ ] Add other special forms on demand: other conditional (`case_when`, `coalesce`), boolean logical (logical `and`, `or`). - [ ] Add Python bindings for special forms. - [ ] Add bindings for other implementations: R, Substrait, etc. [1] https://en.wikipedia.org/wiki/Evaluation_strategy#Strict_binding_strategies [2] https://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_name [3] https://courses.cs.northwestern.edu/325/readings/special-forms.html [4] https://facebookincubator.github.io/velox/develop/expression-evaluation.html ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
