Re: [PR] Add lambda support and array_transform udf [datafusion]

via GitHub Tue, 21 Apr 2026 00:06:58 -0700


gstvg commented on PR #21679:
URL: https://github.com/apache/datafusion/pull/21679#issuecomment-4286551740


   > ... requiring the user to provide a valid typed value, same as other 
engines.
   
   @LiaCastaneda  I thought they didn't require the exact type, but after 
checking spark and clickhouse, they do require it, thanks for the info. So 
answering the initial question of @rluvaton and @pepijnve, the current design 
does allow to implement the spark reduce by using initial_value type as the 
merge accumulator type and the finish only argument type, as spark itself does: 
   
    [checking initial_value/zero and merge output to 
match](https://github.com/apache/spark/blob/96581a5b9611d28518eb22c66add581b9724a899/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L840-L841)
   [using initial_value/zero as accumulator 
type](https://github.com/apache/spark/blob/96581a5b9611d28518eb22c66add581b9724a899/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L861-L863)
   [test with explicit casts (the only test with arrays that I 
found)](https://github.com/apache/spark/blob/fd472f1ca39113234cbf3f647ebb0ecb6d0bee74/sql/core/src/test/scala/org/apache/spark/sql/DataFrameComplexTypeSuite.scala#L198)
   
   But it would be nicer to not require explicit casts, 
[snowflake](https://docs.snowflake.com/en/sql-reference/functions/reduce#use-an-array-for-the-accumulator)
 and the [next release of 
duckdb](https://www.github.com/duckdb/duckdb/pull/21032/changes#diff-e037d097f599d5e8b7499ad31c4325156fef4f5327760310907140f6658ca1caR78-R82)
 will support that, so I agree that we should at least plan to support it 
without requiring breaking changes in the future.
   
   My WIP branch is currently using this:
   
   <details>
   
   ```rust
   enum LambdaParametersProgress {
       // the parameters of some lambdas are unkrnow 
       Partial(Vec<Option<Vec<FieldRef>>>),
       // the parameters of all lambdas are know
       Complete(Vec<Vec<FieldRef>>),
   }
   
   trait HigherOrderUDF {
       fn lambda_parameters(
           &self,
           step: usize,
           // Some = the return field of a lambda that could be computed 
because it's parameters have been returned in some previous call 
          // None = the previous call didn't returned the parameters of the 
lambda
           fields: &[ValueOrLambda<FieldRef, Option<FieldRef>>],
       ) -> Result<LambdaParametersProgress>;
   }
   
   impl HigherOrderUDF for ArrayReduce {
       fn lambda_parameters(
           &self,
           step: usize,
           fields: &[ValueOrLambda<FieldRef, Option<FieldRef>>],
       ) -> Result<LambdaParametersProgress> {
           // optional finish not handled for simplicity
           let [
               ValueOrLambda::Value(list),
               ValueOrLambda::Value(initial_value),
               ValueOrLambda::Lambda(merge),
               ValueOrLambda::Lambda(_finish),
           ] = fields
           else {
               return plan_err!(
                   "reduce expects a list value, then an initial value, then a 
merge lambda and finally a finish lambda"
               );
           };
   
           let list_field = match list.data_type() {
               // currently only fsl supported
               DataType::FixedSizeList(field, _) => field,
               _ => return plan_err!("reduce expects a list as it's first 
argument"),
           };
   
           Ok(match (step, merge) {
               (0, _) => {
                   // at the first step, we use the initial_value as merge 
accumulator,
                   // and return None for finish since we don't know the output 
of merge
                   LambdaParametersProgress::Partial(vec![
                       // merge
                       Some(vec![Arc::clone(initial_value), 
Arc::clone(list_field)]),
                       // finish
                       None,
                   ])
               }
               (1, Some(accumulator)) => {
                   // now we can use the merge output as it's accumulator and
                   // as the finish parameter
                   LambdaParametersProgress::Complete(vec![
                       // merge
                       vec![Arc::clone(accumulator), Arc::clone(list_field)],
                       // finish
                       vec![Arc::clone(accumulator)],
                   ])
               }
               (1, None) => {
                   return plan_err!("merge should be resolved at reduce step 1 
(0-based)");
               }
               _ => todo!(),
           })
       }
   }
   ```
   
   </details>
   
   I hope to push this tomorrow to another branch with the CI running with a 
minimal implementation of array_reduce


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add lambda support and array_transform udf [datafusion]

Reply via email to