paultmathew opened a new issue, #50031: URL: https://github.com/apache/arrow/issues/50031
### Describe the enhancement requested `pyarrow.compute.Expression` has no Python-accessible way to enumerate the fields it references. The C++ side already exposes the underlying primitive ([`arrow::compute::FieldsInExpression`](https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/expression.h#L174)), but the Python `Expression` class only surfaces `cast`, `equals`, `is_null`, `is_nan`, `is_valid`, `isin`, and Substrait round-trip. Every downstream tool that needs the column set of a predicate today either: 1. Regex-parses `str(expression)` (fragile — quoted string literals and keywords like `and` leak into the result). 2. Serializes to Substrait via `to_substrait(schema)` and walks the protobuf (heavy — requires a bound schema and a substrait dependency just to ask "which columns?"). 3. Maintains a parallel AST upstream of `pc.Expression`, like [Ray Data's `_PyArrowExpressionVisitor`](https://docs.ray.io/en/releases-2.54.1/_modules/ray/data/expressions.html). Exposing the existing C++ primitive removes all three workarounds. ### Motivating use cases The recurring shape is: a library or end user has a `pc.Expression` in hand and needs to decide **which columns to read off disk** before evaluating it. 1. **Column projection on cold storage.** Wrapping `pyarrow.dataset.Scanner` or `pyiceberg.Table.scan(...)` with a user-supplied filter — the wrapper wants to set `selected_fields = user_projection ∪ filter_refs` to avoid pulling unused columns off S3 / disk. 2. **Conditional MERGE / upsert on Iceberg.** PyIceberg's `Table.upsert` currently has no `when_matched_condition` parameter ([apache/iceberg-python#1534](https://github.com/apache/iceberg-python/pull/1534) explicitly scoped to "when matched update all / when not matched insert all" and directed users to Spark for predicate-based MERGE). Implementing a conditional upsert in Python requires projecting only the destination columns the predicate touches before joining and filtering — which needs field-ref introspection. 3. **Predicate splitting across two sources.** Any library that accepts a single user-facing predicate and routes it across a join (source ↔ target, stream ↔ table, etc.) needs to bucket field references by side. 4. **Ray Data, delta-rs, Lance.** Cross-engine routers that translate `pc.Expression` to a non-Arrow execution engine all start with the same question — which fields does it touch? — to decide which engine knows about which columns and which side of a join to push the filter on. ### Prior discussion Comment thread on the closed [#27160 [Python] Allow to create field reference to nested field](https://github.com/apache/arrow/issues/27160) records this as a known gap that was never tracked: > bkietz: > "currently field_refs can only extract a field from the scanned dataset. > It'd be helpful if they could also extract a field from an Expression." > > nealrichardson: > "Agree that it would be helpful (possibly necessary) to be able to extract > a field from an Expression more generally." That thread closed on the inverse direction (constructing nested refs); this issue tracks the missing direction. ### Proposed API \`\`\`python def field_refs(self) -> list[str | int | tuple[str | int, ...]]: """ Return the field references contained in this expression. Each reference is reported once per call site (matches the C++ \`FieldsInExpression\` semantics). The returned value shape mirrors \`pyarrow.compute.field()\`'s input — by-name references come back as \`str\`, by-index as \`int\`, and nested references as \`tuple\`. """ \`\`\` Round-trip example: \`\`\`python >>> import pyarrow.compute as pc >>> ((pc.field("a") > 0) & pc.field("b").is_null()).field_refs() ['a', 'b'] >>> pc.field("user", "city").field_refs() [('user', 'city')] >>> pc.scalar(5).field_refs() [] \`\`\` ### Open API decisions to settle before implementation | Decision | Proposed | Rationale | |---|---|---| | Method name | `field_refs()` | Mirrors C++ free function `FieldsInExpression` and the existing singular accessor `Expression.field_ref()`. Alternatives: `references()`, `referenced_fields()`. | | Return type | `list[str \| int \| tuple]` | Round-trip compatible with `pc.field(*ref)`. Avoids introducing a new public `FieldRef` Python type (which would deserve its own design discussion — likely a follow-up). | | Dedup | No | Matches C++ `FieldsInExpression`. Callers do `set(...)` if desired. | | Order | Traversal (left-to-right, depth-first) | Documented as "not part of the public contract" to leave room. | | Single-element FieldPath | Plain `int`, not `(int,)` | Symmetric with `pc.field(3)` returning a non-nested ref. | Happy to defer any of these to maintainer preference. ### Implementation outline Small (~80 lines including tests). Three files touched: - `python/pyarrow/includes/libarrow.pxd` — declare `FieldsInExpression` and the additional `CFieldRef` accessors (`IsName`, `IsFieldPath`, `IsNested`, `field_path`, `nested_refs`) needed for the conversion helper. - `python/pyarrow/_compute.pyx` — add a `_fieldref_to_python` helper and a `field_refs()` method on `Expression`. Both small. - `python/pyarrow/tests/test_compute.py` — coverage for the four FieldRef shapes (name / index / nested name / nested index), empty (constant expression), and round-trip through `pc.field()`. Plus one autosummary line in `docs/source/python/api/compute.rst`. I'm happy to put up a PR once the API is agreed. ### Related issues - [#27160](https://github.com/apache/arrow/issues/27160) — closed; this issue captures the unfiled follow-up. - [#34433](https://github.com/apache/arrow/issues/34433) — adjacent; asks for `table.evaluate(expr)` returning a boolean mask. Both are "more handles on `Expression`" requests but distinct in scope. - [#49885](https://github.com/apache/arrow/issues/49885) — adjacent; binding unresolved Substrait expressions. Complementary work on the Expression API. ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
