gstvg commented on PR #21323:
URL: https://github.com/apache/datafusion/pull/21323#issuecomment-4374867945
@rluvaton There's the *`LambdaExpr`* in the plan (this is where the
`ProjectionExprs` can replace the `projection` field)
```rust
pub struct LambdaExpr {
params: Vec<String>,
body: Arc<dyn PhysicalExpr>,
projected_body: Arc<dyn PhysicalExpr>,
projection: Vec<usize>,
}
```
Which sits as a higher-order function child, as in `array_transform([1,2,3],
v -> v+1)`:
```
HigherOrderFunction(array_transform)
├── args[0]: LiteralExpr([1,2,3])
└── args[1]: LambdaExpr
├── params: ["v"]
└── body: BinaryExpr(+)
├── LambdaVariable("v")
└── LiteralExpr(1)
```
I see now this comment at `LambdaArgument` which I think may have added to
confusion.
```rust
/// The parameters defined in this lambda
///
/// For example, for `array_transform([2], v -> -v)`,
/// this will be `vec![Field::new("v", DataType::Int32, true)]`
params: Vec<FieldRef>,
...
/// Cached schema built from `params`. Reused across every `evaluate`
call
/// (and across every nested-list iteration when the lambda is called
once
/// per outer sublist), avoiding the per-call `Schema::new` build that
/// includes constructing the internal name -> index map.
schema: SchemaRef,
```
`schema` will only be re-utilized for functions that invokes `evaluate`
*multiple times per batch*, like `array_reduce` will, but not for the existing
`array_transform` and `array_any_match`, which invokes `evaluate` a single time
per batch.
`LambdaArgument` it's just the lambda variant at
`HigherOrderFunctionArgs.args` (`ValueOrLambda<ColumnarValue, LambdaArgument>`)
and holds the necessary information so that the function implementer can invoke
the lambda body without dealing with implementation details. It includes the
captured outer column arrays since `HigherOrderFunctionArgs` doesn't contains
the input batch, for example. If the implementer received only the `LambdaExpr`
it would need to also receive the whole input batch, handle computing only the
requested parameters, projecting the input batch, and merging it with the
computed parameters.
Now to save costs, we can modify `LambdaExpr.params` to store
`Vec<FieldRef>` instead of `Vec<String>`, which would save the cost of creating
it for every batch (it involves calling `HigherOrderUDF::lambda_parameters`). I
already experimented with it at
http://github.com/gstvg/arrow-datafusion/commit/3a8ad01ca3ed62d2aad3633131c0041e1b04f810.
(It also make other parts of the code much simpler, but I realized that a bit
late for #21679)
Now to construct the `schema` or the `Projector` from
[ProjectionExprs::make_projector](https://docs.rs/datafusion/latest/datafusion/physical_expr/projection/struct.ProjectionExprs.html#method.make_projector),
which now includes captured outer columns, we need the outer schema, which is
not always present during physical expr creation, and it's only available for
sure during evaluation via the `input_batch`. It's trivial to save it behind a
`OnceLock` to compute only during the first `evaluate` call and reuse
afterwards, but I see a single usage of it at hash join implementation in
`physical-plan`, and none at `physical-expr`, it's okay to use it?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]