gstvg commented on PR #21323:
URL: https://github.com/apache/datafusion/pull/21323#issuecomment-4374867945

   @rluvaton There's the *`LambdaExpr`* in the plan (this is where the 
`ProjectionExprs` can replace the `projection` field)
   
   ```rust
   pub struct LambdaExpr {
       params: Vec<String>,
       body: Arc<dyn PhysicalExpr>,
       projected_body: Arc<dyn PhysicalExpr>,
       projection: Vec<usize>,
   }
   ```
   
   Which sits as a higher-order function child, as in `array_transform([1,2,3], 
v -> v+1)`:
   
   ```
    HigherOrderFunction(array_transform)
    ├── args[0]: LiteralExpr([1,2,3])
    └── args[1]: LambdaExpr
        ├── params: ["v"]
        └── body: BinaryExpr(+)
            ├── LambdaVariable("v")
            └── LiteralExpr(1)
    ```
   
   I see now this comment at `LambdaArgument` which I think may have added to 
confusion.
   
   ```rust
       /// The parameters defined in this lambda
       ///
       /// For example, for `array_transform([2], v -> -v)`,
       /// this will be `vec![Field::new("v", DataType::Int32, true)]`
       params: Vec<FieldRef>,
      ...
       /// Cached schema built from `params`. Reused across every `evaluate` 
call
       /// (and across every nested-list iteration when the lambda is called 
once
       /// per outer sublist), avoiding the per-call `Schema::new` build that
       /// includes constructing the internal name -> index map.
       schema: SchemaRef,
   ```
   
   `schema` will only be re-utilized for functions that invokes `evaluate` 
*multiple times per batch*, like `array_reduce` will, but not for the existing 
`array_transform` and `array_any_match`, which invokes `evaluate` a single time 
per batch.
   
   `LambdaArgument` it's just the lambda variant at 
`HigherOrderFunctionArgs.args` (`ValueOrLambda<ColumnarValue, LambdaArgument>`) 
and holds the necessary information so that the function implementer can invoke 
the lambda body without dealing with implementation details. It includes the 
captured outer column arrays since `HigherOrderFunctionArgs` doesn't contains 
the input batch, for example. If the implementer received only the `LambdaExpr` 
it would need to also receive the whole input batch, handle computing only the 
requested parameters, projecting the input batch, and merging it with the 
computed parameters.
   
   Now to save costs, we can modify `LambdaExpr.params` to store 
`Vec<FieldRef>` instead of `Vec<String>`, which would save the cost of creating 
it for every batch (it involves calling `HigherOrderUDF::lambda_parameters`). I 
already experimented with it at 
http://github.com/gstvg/arrow-datafusion/commit/3a8ad01ca3ed62d2aad3633131c0041e1b04f810.
 (It also make other parts of the code much simpler, but I realized that a bit 
late for #21679)
   
   Now to construct the `schema` or the `Projector` from 
[ProjectionExprs::make_projector](https://docs.rs/datafusion/latest/datafusion/physical_expr/projection/struct.ProjectionExprs.html#method.make_projector),
 which now includes captured outer columns, we need the outer schema, which is 
not always present during physical expr creation, and it's only available for 
sure during evaluation via the `input_batch`. It's trivial to save it behind a 
`OnceLock` to compute only during the first `evaluate` call and reuse 
afterwards, but I see a single usage of it at hash join implementation in 
`physical-plan`, and none at `physical-expr`, it's okay to use it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to