paultmathew opened a new issue, #50031:
URL: https://github.com/apache/arrow/issues/50031

   ### Describe the enhancement requested
   
   `pyarrow.compute.Expression` has no Python-accessible way to enumerate the
   fields it references. The C++ side already exposes the underlying primitive
   
([`arrow::compute::FieldsInExpression`](https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/expression.h#L174)),
   but the Python `Expression` class only surfaces `cast`, `equals`, `is_null`,
   `is_nan`, `is_valid`, `isin`, and Substrait round-trip. Every downstream tool
   that needs the column set of a predicate today either:
   
   1. Regex-parses `str(expression)` (fragile — quoted string literals and
      keywords like `and` leak into the result).
   2. Serializes to Substrait via `to_substrait(schema)` and walks the protobuf
      (heavy — requires a bound schema and a substrait dependency just to ask
      "which columns?").
   3. Maintains a parallel AST upstream of `pc.Expression`, like
      [Ray Data's 
`_PyArrowExpressionVisitor`](https://docs.ray.io/en/releases-2.54.1/_modules/ray/data/expressions.html).
   
   Exposing the existing C++ primitive removes all three workarounds.
   
   ### Motivating use cases
   
   The recurring shape is: a library or end user has a `pc.Expression` in hand
   and needs to decide **which columns to read off disk** before evaluating it.
   
   1. **Column projection on cold storage.** Wrapping `pyarrow.dataset.Scanner`
      or `pyiceberg.Table.scan(...)` with a user-supplied filter — the wrapper
      wants to set `selected_fields = user_projection ∪ filter_refs` to avoid
      pulling unused columns off S3 / disk.
   2. **Conditional MERGE / upsert on Iceberg.** PyIceberg's `Table.upsert`
      currently has no `when_matched_condition` parameter
      
([apache/iceberg-python#1534](https://github.com/apache/iceberg-python/pull/1534)
      explicitly scoped to "when matched update all / when not matched insert
      all" and directed users to Spark for predicate-based MERGE). Implementing
      a conditional upsert in Python requires projecting only the destination
      columns the predicate touches before joining and filtering — which needs
      field-ref introspection.
   3. **Predicate splitting across two sources.** Any library that accepts a
      single user-facing predicate and routes it across a join (source ↔ target,
      stream ↔ table, etc.) needs to bucket field references by side.
   4. **Ray Data, delta-rs, Lance.** Cross-engine routers that translate
      `pc.Expression` to a non-Arrow execution engine all start with the same
      question — which fields does it touch? — to decide which engine knows
      about which columns and which side of a join to push the filter on.
   
   ### Prior discussion
   
   Comment thread on the closed
   [#27160 [Python] Allow to create field reference to nested 
field](https://github.com/apache/arrow/issues/27160)
   records this as a known gap that was never tracked:
   
   > bkietz:
   > "currently field_refs can only extract a field from the scanned dataset.
   > It'd be helpful if they could also extract a field from an Expression."
   >
   > nealrichardson:
   > "Agree that it would be helpful (possibly necessary) to be able to extract
   > a field from an Expression more generally."
   
   That thread closed on the inverse direction (constructing nested refs);
   this issue tracks the missing direction.
   
   ### Proposed API
   
   \`\`\`python
   def field_refs(self) -> list[str | int | tuple[str | int, ...]]:
       """
       Return the field references contained in this expression.
   
       Each reference is reported once per call site (matches the C++
       \`FieldsInExpression\` semantics). The returned value shape mirrors
       \`pyarrow.compute.field()\`'s input — by-name references come back as
       \`str\`, by-index as \`int\`, and nested references as \`tuple\`.
       """
   \`\`\`
   
   Round-trip example:
   
   \`\`\`python
   >>> import pyarrow.compute as pc
   >>> ((pc.field("a") > 0) & pc.field("b").is_null()).field_refs()
   ['a', 'b']
   >>> pc.field("user", "city").field_refs()
   [('user', 'city')]
   >>> pc.scalar(5).field_refs()
   []
   \`\`\`
   
   ### Open API decisions to settle before implementation
   
   | Decision | Proposed | Rationale |
   |---|---|---|
   | Method name | `field_refs()` | Mirrors C++ free function 
`FieldsInExpression` and the existing singular accessor 
`Expression.field_ref()`. Alternatives: `references()`, `referenced_fields()`. |
   | Return type | `list[str \| int \| tuple]` | Round-trip compatible with 
`pc.field(*ref)`. Avoids introducing a new public `FieldRef` Python type (which 
would deserve its own design discussion — likely a follow-up). |
   | Dedup | No | Matches C++ `FieldsInExpression`. Callers do `set(...)` if 
desired. |
   | Order | Traversal (left-to-right, depth-first) | Documented as "not part 
of the public contract" to leave room. |
   | Single-element FieldPath | Plain `int`, not `(int,)` | Symmetric with 
`pc.field(3)` returning a non-nested ref. |
   
   Happy to defer any of these to maintainer preference.
   
   ### Implementation outline
   
   Small (~80 lines including tests). Three files touched:
   
   - `python/pyarrow/includes/libarrow.pxd` — declare `FieldsInExpression` and
     the additional `CFieldRef` accessors (`IsName`, `IsFieldPath`, `IsNested`,
     `field_path`, `nested_refs`) needed for the conversion helper.
   - `python/pyarrow/_compute.pyx` — add a `_fieldref_to_python` helper and a
     `field_refs()` method on `Expression`. Both small.
   - `python/pyarrow/tests/test_compute.py` — coverage for the four FieldRef
     shapes (name / index / nested name / nested index), empty (constant
     expression), and round-trip through `pc.field()`.
   
   Plus one autosummary line in `docs/source/python/api/compute.rst`.
   
   I'm happy to put up a PR once the API is agreed.
   
   ### Related issues
   
   - [#27160](https://github.com/apache/arrow/issues/27160) — closed; this
     issue captures the unfiled follow-up.
   - [#34433](https://github.com/apache/arrow/issues/34433) — adjacent;
     asks for `table.evaluate(expr)` returning a boolean mask. Both are
     "more handles on `Expression`" requests but distinct in scope.
   - [#49885](https://github.com/apache/arrow/issues/49885) — adjacent;
     binding unresolved Substrait expressions. Complementary work on the
     Expression API.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to