zanmato1984 opened a new issue, #47374:
URL: https://github.com/apache/arrow/issues/47374

   ### Describe the enhancement requested
   
   This is the umbrella issue for supporting special form expression in Arrow 
compute.
   
   ### Rationale
   
   It all starts from #41094. The comment 
https://github.com/apache/arrow/issues/41094#issuecomment-2087716483 is the 
rationale.
   
   To summarize a bit, consider expression `if (a != 0) then (b / a) else 0`. 
In current Arrow expression evaluation, it is a nested function call to 
`if_else(ne(a, 0), div(b, a), 0)`, which eagerly evaluates all its parameters 
including the `div(b, a)` even if `a` can be `0`, or more formally, this is a 
"strict binding" strategy [1]. This is unintuitive and problematic given the 
meaning of `if_else`. What we expect here is a non-strict binding strategy, 
"call-by-name" [2]. Most programming languages have exceptional support for a 
handful "special" expressions , e.g. Lisp calls those "special forms" [3]. And 
as a closer peer, Velox uses the same term [4]. (Thanks @felipecrv for all 
these references!)
   
   It is especially tricky to support such special forms in a vectorized engine 
like Arrow. There are two main challenges:
   * For conditional evaluation like `if cond then if_branch else else_branch`, 
all the arguments are evaluated on a batch basis. However which branch to 
evaluate (`if_branch` VS. `else_branch`) is decided by `cond` on a row basis. 
We need a "selection vector" during evaluation to mask which rows in the batch 
to evaluate and which not - this is about not only the efficiency, but also the 
"visible side effect" (consider `divide by zero` thrown in a `divisor != 0` 
branch).
   * In addition to the selection vector, the kernels must respect it. 
Currently none does. And it's almost impossible for us to make all existing 
kernels to respect the selection vector at once (and we probably never will). 
Thus we need an incremental way to add selection-vector-aware kernels on 
demand, meanwhile accommodate legacy (selection-vector-non-aware) kernels to be 
executed "selection-vector-aware"-ly in a general manner - the idea is to first 
"gather" selected rows from the batch into a new batch, evaluate the expression 
on the new batch, then "scatter" the result rows into the positions where it 
belongs in the original batch.
   
   ### Approach
   
   I did try to solve this in #42106 last year, however neither the reviewer 
nor myself were quite satisfied with the solution back then. Since then I've 
been reworking on this off and on. Until now I'm confident about the new 
approach. It's a large amount of work, so I split the mission into several 
separate yet self-consistent concrete tasks to ease the review and be more 
friendly to potential community help.
   
   - [ ] Issue # placeholder: Move `scatter` function into compute core. As 
mentioned above, the `take` ("gather") and `scatter` functions play an 
essential role to support selective execution for arbitrary legacy 
(selection-vector-non-aware) kernels, thus they have to exist in libarrow. 
Function `take` already does. We need `scatter`.
   - [ ] Issue # placeholder: Support "selective execution" for kernels. This 
is IMO the most challenging part. As mentioned above, we need an incremental 
way to add selection-vector-aware kernels on demand, meanwhile accommodate 
legacy (selection-vector-non-aware) kernels to be executed 
"selection-vector-aware"-ly in a general manner.
   - [ ] Issue # placeholder: Introduce structures of special form into Arrow 
compute C++, and probably add `if_else` as the first one because it's almost 
the most wanted one. The evaluation of the special form will generate selection 
vector on the fly and ingest it into the evaluation of its arguments, 
leveraging the selective execution for them.
   - [ ] Add selective kernels to avoid the overhead of "general selective 
execution" (gather/scatter): e.g. `divide` might be the most common one.
   - [ ] Add other special forms on demand: other conditional (`case_when`, 
`coalesce`), boolean logical (logical `and`, `or`).
   - [ ] Add Python bindings for special forms.
   - [ ] Add bindings for other implementations: R, Substrait, etc.
   
   [1] 
https://en.wikipedia.org/wiki/Evaluation_strategy#Strict_binding_strategies
   [2] https://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_name
   [3] https://courses.cs.northwestern.edu/325/readings/special-forms.html
   [4] 
https://facebookincubator.github.io/velox/develop/expression-evaluation.html
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to