timsaucer opened a new pull request, #1545: URL: https://github.com/apache/datafusion-python/pull/1545
# Which issue does this PR close? No associated issue. **PR 2 of 4** stacked on [#1544](https://github.com/apache/datafusion-python/pull/1544). The diff shown against \`main\` is cumulative until #1544 merges — review the commits on \`pr2-agg-window-inline\` directly, or wait for #1544 to merge for a clean diff. # Rationale for this change PR 1 closed the round-trip for scalar UDFs. The same shipped-expression problem applies to Python aggregate and window UDFs: their accumulator / partition-evaluator factory is a Python callable, so a receiver that only has the UDF *name* cannot reconstruct one. This PR extends the inline-encoding mechanism so the natural \`pickle.dumps(expr)\` pattern also works for expressions referencing Python UDAFs and UDWFs. # What changes are included in this PR? Codec extension is a straight parallel of the scalar path. New wire-format families: | Kind | Family magic | Cloudpickle tuple shape | |--------|--------------|---------------------------------------------------------------------------------------------------------------| | Agg | \`DFPYUDA\` | \`(name, accumulator_factory, input_schema_bytes, return_schema_bytes, state_schema_bytes, volatility_str)\` | | Window | \`DFPYUDW\` | \`(name, evaluator_factory, input_schema_bytes, return_schema_bytes, volatility_str)\` | The aggregate state schema is encoded as a full IPC schema (not a positional \`Vec<DataType>\`), so the post-decode UDF reports the same names, nullability, and metadata as the sender. This matters for accumulators whose \`StateFieldsArgs\` consumers key off names rather than positions. To let the codec downcast and grab the Python callable directly, two existing UDF impls are restructured: - \`udaf.rs\`: introduces a named \`PythonFunctionAggregateUDF\` that stores the \`Py<PyAny>\` accumulator factory. \`PyAggregateUDF.__new__\` now wires \`AggregateUDF::new_from_impl(PythonFunctionAggregateUDF::new(...))\` instead of the prior \`create_udaf\` + closure path. State field names default to synthesized \`state_{i}\` on the Python constructor path; \`from_parts\` (called by the decoder) restores the full schema from the IPC payload. - \`udwf.rs\`: renames \`MultiColumnWindowUDF\` → \`PythonFunctionWindowUDF\` and drops the \`PartitionEvaluatorFactory\` + \`PtrEq\` wrapping. Stores the \`Py<PyAny>\` evaluator directly. \`PartialEq\` / \`Hash\` pick up the same pointer-identity fast path and \`__eq__\` exception-logging behavior as \`PythonFunctionScalarUDF\` in PR 1. User-facing surface: - \`AggregateUDF.name\` and \`WindowUDF.name\` properties (parallel to \`ScalarUDF.name\` from PR 1). - Existing UDAF/UDWF construction paths are unchanged at the user level — same constructors, same arguments, same semantics. The internal impl swap is invisible. # Are there any user-facing changes? - Python aggregate and window UDFs survive \`pickle.dumps\` / \`pickle.loads\` and \`Expr.to_bytes\` / \`Expr.from_bytes\` round-trips. The decoded UDF reproduces the original state schema and runs end-to-end (verified by \`test_agg_udf_evaluates_after_roundtrip\`, which aggregates over a 5-row frame after a pickle round-trip). - \`AggregateUDF.name\` and \`WindowUDF.name\` are new public properties. - \`MultiColumnWindowUDF\` is renamed to \`PythonFunctionWindowUDF\`. The struct was \`pub\` but only used within the crate; no in-tree caller breaks. Downstream Rust users importing it directly would need to update. The \`MultiColumnWindowUDF\` rename is a Rust-side breaking change, so adding \`api change\` even though no Python-facing API breaks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
