timsaucer opened a new pull request, #1545:
URL: https://github.com/apache/datafusion-python/pull/1545

   # Which issue does this PR close?
   
   No associated issue. **PR 2 of 4** stacked on 
[#1544](https://github.com/apache/datafusion-python/pull/1544). The diff shown 
against \`main\` is cumulative until #1544 merges — review the commits on 
\`pr2-agg-window-inline\` directly, or wait for #1544 to merge for a clean diff.
   
   # Rationale for this change
   
   PR 1 closed the round-trip for scalar UDFs. The same shipped-expression 
problem applies to Python aggregate and window UDFs: their accumulator / 
partition-evaluator factory is a Python callable, so a receiver that only has 
the UDF *name* cannot reconstruct one. This PR extends the inline-encoding 
mechanism so the natural \`pickle.dumps(expr)\` pattern also works for 
expressions referencing Python UDAFs and UDWFs.
   
   # What changes are included in this PR?
   
   Codec extension is a straight parallel of the scalar path. New wire-format 
families:
   
   | Kind   | Family magic | Cloudpickle tuple shape                            
                                                           |
   
|--------|--------------|---------------------------------------------------------------------------------------------------------------|
   | Agg    | \`DFPYUDA\`  | \`(name, accumulator_factory, input_schema_bytes, 
return_schema_bytes, state_schema_bytes, volatility_str)\`  |
   | Window | \`DFPYUDW\`  | \`(name, evaluator_factory, input_schema_bytes, 
return_schema_bytes, volatility_str)\`                        |
   
   The aggregate state schema is encoded as a full IPC schema (not a positional 
\`Vec<DataType>\`), so the post-decode UDF reports the same names, nullability, 
and metadata as the sender. This matters for accumulators whose 
\`StateFieldsArgs\` consumers key off names rather than positions.
   
   To let the codec downcast and grab the Python callable directly, two 
existing UDF impls are restructured:
   
   - \`udaf.rs\`: introduces a named \`PythonFunctionAggregateUDF\` that stores 
the \`Py<PyAny>\` accumulator factory. \`PyAggregateUDF.__new__\` now wires 
\`AggregateUDF::new_from_impl(PythonFunctionAggregateUDF::new(...))\` instead 
of the prior \`create_udaf\` + closure path. State field names default to 
synthesized \`state_{i}\` on the Python constructor path; \`from_parts\` 
(called by the decoder) restores the full schema from the IPC payload.
   - \`udwf.rs\`: renames \`MultiColumnWindowUDF\` → 
\`PythonFunctionWindowUDF\` and drops the \`PartitionEvaluatorFactory\` + 
\`PtrEq\` wrapping. Stores the \`Py<PyAny>\` evaluator directly. \`PartialEq\` 
/ \`Hash\` pick up the same pointer-identity fast path and \`__eq__\` 
exception-logging behavior as \`PythonFunctionScalarUDF\` in PR 1.
   
   User-facing surface:
   
   - \`AggregateUDF.name\` and \`WindowUDF.name\` properties (parallel to 
\`ScalarUDF.name\` from PR 1).
   - Existing UDAF/UDWF construction paths are unchanged at the user level — 
same constructors, same arguments, same semantics. The internal impl swap is 
invisible.
   
   # Are there any user-facing changes?
   
   - Python aggregate and window UDFs survive \`pickle.dumps\` / 
\`pickle.loads\` and \`Expr.to_bytes\` / \`Expr.from_bytes\` round-trips. The 
decoded UDF reproduces the original state schema and runs end-to-end (verified 
by \`test_agg_udf_evaluates_after_roundtrip\`, which aggregates over a 5-row 
frame after a pickle round-trip).
   - \`AggregateUDF.name\` and \`WindowUDF.name\` are new public properties.
   - \`MultiColumnWindowUDF\` is renamed to \`PythonFunctionWindowUDF\`. The 
struct was \`pub\` but only used within the crate; no in-tree caller breaks. 
Downstream Rust users importing it directly would need to update.
   
   The \`MultiColumnWindowUDF\` rename is a Rust-side breaking change, so 
adding \`api change\` even though no Python-facing API breaks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to