timsaucer opened a new pull request, #1546: URL: https://github.com/apache/datafusion-python/pull/1546
# Which issue does this PR close? No associated issue. **PR 3 of 4** stacked on [#1545](https://github.com/apache/datafusion-python/pull/1545) (which is itself stacked on [#1544](https://github.com/apache/datafusion-python/pull/1544)). The diff against \`main\` is cumulative until the prior PRs merge — review the commits on \`pr3-toggle-sender-strict\` directly for the PR3 delta. # Rationale for this change PRs 1 and 2 ship Python UDFs inline through the codec. Two follow-on needs: 1. **Cross-language wire bytes.** A producer that wants its serialized expression to round-trip through a non-Python decoder (e.g. a Rust-only datafusion-distributed worker) needs UDFs to travel by name, not as a cloudpickle blob. 2. **Untrusted-input decoding.** A receiver that may read \`Expr.from_bytes\` input from an untrusted source must refuse to invoke \`cloudpickle.loads\` on the inline payload. (\`pickle.loads\` on untrusted input is still unsafe regardless of this toggle — see the security note in the docstrings.) Both needs are served by the same on/off switch at the codec level. The codec already sits on every session, so the toggle is naturally per-session. # What changes are included in this PR? **Codec layer.** \`PythonLogicalCodec\` and \`PythonPhysicalCodec\` gain a \`python_udf_inlining: bool\` (default \`true\`) plus a \`with_python_udf_inlining(enabled)\` builder. Encode paths short-circuit to \`inner\` when the toggle is off (UDFs travel by name); decode paths return an \`Execution\` error instead of invoking \`cloudpickle.loads\` if they recognize a \`DFPY*\` family magic on a strict codec. The refusal message names both the UDF and the wire family so an operator can immediately see whether to re-encode the bytes upstream or register the UDF on the receiver. **Session layer.** \`PySessionContext::with_python_udf_inlining(enabled)\` returns a new session whose stacked logical + physical codecs both carry the toggle. The \`Arc<SessionState>\` is shared (cheap clone), only the codec pair is rebuilt, so registrations and config stay attached. \`SessionContext.with_python_udf_inlining(*, enabled)\` is the Python wrapper. \`enabled\` is keyword-only because positional booleans at the call site read as opaque. **Sender-side context.** \`datafusion.ipc\` gains \`set_sender_ctx\` / \`get_sender_ctx\` / \`clear_sender_ctx\` thread-locals. \`Expr.__reduce__\` now consults \`get_sender_ctx()\` to pick the codec for outbound pickles — without that hook, \`pickle.dumps\` always invokes \`Expr.to_bytes()\` with no context, so a strict session would never affect the pickle path. \`Expr.to_bytes(ctx)\` calls with an explicit \`ctx\` are unaffected. **Tests.** \`test_pickle_expr.py\` picks up: - \`TestPythonUdfInliningToggle\` — round-trips through a strict session, asserts the strict-side refusal error, exercises the explicit-ctx fast path, and covers an off-then-on toggle to ensure the field is not sticky. - \`TestWorkerCtxLifecycle\` and \`TestSenderCtxLifecycle\` — set/clear/threading semantics for the two thread-locals. \`test_pickle_multiprocessing.py\` (new) plus \`_pickle_multiprocessing_helpers.py\` (new) exercise the full driver → worker round-trip on a \`multiprocessing.Pool\` with \`set_worker_ctx\` installed in the initializer and \`set_sender_ctx\` on the driver. **CI.** \`.github/workflows/test.yml\` gets a 30-minute \`timeout-minutes\` backstop so a hung pickle worker (e.g. during a future regression) cannot block the matrix indefinitely. # Are there any user-facing changes? Yes: - \`SessionContext.with_python_udf_inlining(*, enabled: bool)\` is a new public method. Use \`enabled=False\` to either (a) produce cross-language wire bytes or (b) refuse to deserialize inline Python payloads from untrusted bytes. - \`datafusion.ipc.set_sender_ctx\` / \`get_sender_ctx\` / \`clear_sender_ctx\` are new public functions for propagating a configured session through \`pickle.dumps\`. No breaking changes — the toggle defaults to \`true\`, matching pre-PR behavior. \`api change\` label not added. The user-guide page documenting the full pattern (and the multiprocessing / Ray runnable examples) lands in PR 4 of this series. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
