jiayuasu opened a new pull request, #835:
URL: https://github.com/apache/sedona-db/pull/835
Wire the `Expr` layer into the existing lazy `DataFrame` so users can filter
rows without writing SQL predicates.
This is the fourth and final small PR completing Phase P1 of #791, building
on #807 (Expr foundation), #823 (operators), and #832 (`DataFrame.select`).
## What's new
```python
from sedonadb.expr import col
df.filter(col("x") > 0)
df.filter(col("x") > 0, col("y") < 10) # multiple preds → AND
df.filter((col("x") > 0) & col("y").is_not_null()) # explicit & also fine
df.where(col("x") > 0) # alias of filter
df.filter(col("x") > 1).filter(col("x") < 4) # chained: two filter
nodes
```
Multiple predicates are AND-combined into a single composed `Expr` before
being handed to DataFusion's `DataFrame::filter`, giving the optimizer one
conjunction to reason about rather than N stacked filter nodes. Chained
`.filter().filter()` calls produce two filter nodes in the plan but the same
result.
## Decision worth flagging — rejecting `Literal` 🛎️ cc @paleolimbot
`filter()` rejects both `str` and `Literal` arguments at the Python boundary.
- **Strings** are not interpreted as SQL predicates — that's a separate
`query()`-style feature, not this PR's surface.
- **`Literal`** is rejected because `filter(lit(True))` (or any
`filter(lit(value))`) is almost always a typo: DataFusion would silently accept
`WHERE 7` as truthy, producing a filter that looks present in the source but is
actually a no-op. If a user genuinely wants a constant predicate they can write
`col("flag") == lit(True)`, which forces the intent.
This is more restrictive than `select`, which does accept `Literal`. The
asymmetry is intentional: a literal as a projection is meaningful (a constant
column), but a literal as a filter is almost never what you want. Happy to
revisit if you'd rather have `filter(lit(...))` accepted and the resulting
no-op surface as an optimizer-time warning instead.
## Implementation
Rust: `InternalDataFrame::filter(Vec<PyExpr>)` — empty list errors at the
Python boundary; multi-element lists fold via `Expr::and`. Step-by-step
comments explain the AND-folding and the choice not to stack N filter nodes.
Python: `DataFrame.filter(*exprs: "Expr") -> DataFrame` with explicit
`isinstance` checks rejecting `str`, `Literal`, and other types. `where =
filter` as a class-level alias. Type annotation `*exprs: "Expr"` resolves via
the existing TYPE_CHECKING import block.
## Test plan
13 tests in `tests/expr/test_dataframe_filter.py`:
- **Positive**: simple predicate, multi-AND, explicit `&`, `|`, `~`, `isin`,
`where` alias produces identical output, chained `filter().filter()` matches
multi-AND.
- **Lazy**: filter returns a `DataFrame` without materializing.
- **Errors**: empty filter → `ValueError`; string arg → `TypeError`;
`Literal` arg → `TypeError` with the actionable suggestion; unknown column →
`SedonaError` whose message includes the valid field names.
All 13 pass locally. Existing select/expression/literal tests all still pass
(64 total expr tests green). `ruff format` and doctests both clean.
## What's next
This completes the issue-tracker Phase P1 (Expr layer + select/filter on
DataFrame). Phase P2 in the [design
doc](https://github.com/apache/sedona-db/issues/791) covers `Series` /
`Scalar`, `groupby`, `merge`, the curated `sd.st` module, and the cookbook
track.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]