DataFrame.where [sedona-db]

via GitHub Mon, 11 May 2026 22:45:40 -0700


jiayuasu opened a new pull request, #835:
URL: https://github.com/apache/sedona-db/pull/835


   Wire the `Expr` layer into the existing lazy `DataFrame` so users can filter 
rows without writing SQL predicates.
   
   This is the fourth and final small PR completing Phase P1 of #791, building 
on #807 (Expr foundation), #823 (operators), and #832 (`DataFrame.select`).
   
   ## What's new
   
   ```python
   from sedonadb.expr import col
   
   df.filter(col("x") > 0)
   df.filter(col("x") > 0, col("y") < 10)              # multiple preds → AND
   df.filter((col("x") > 0) & col("y").is_not_null())  # explicit & also fine
   df.where(col("x") > 0)                              # alias of filter
   df.filter(col("x") > 1).filter(col("x") < 4)        # chained: two filter 
nodes
   ```
   
   Multiple predicates are AND-combined into a single composed `Expr` before 
being handed to DataFusion's `DataFrame::filter`, giving the optimizer one 
conjunction to reason about rather than N stacked filter nodes. Chained 
`.filter().filter()` calls produce two filter nodes in the plan but the same 
result.
   
   ## Decision worth flagging — rejecting `Literal` 🛎️ cc @paleolimbot
   
   `filter()` rejects both `str` and `Literal` arguments at the Python boundary.
   
   - **Strings** are not interpreted as SQL predicates — that's a separate 
`query()`-style feature, not this PR's surface.
   - **`Literal`** is rejected because `filter(lit(True))` (or any 
`filter(lit(value))`) is almost always a typo: DataFusion would silently accept 
`WHERE 7` as truthy, producing a filter that looks present in the source but is 
actually a no-op. If a user genuinely wants a constant predicate they can write 
`col("flag") == lit(True)`, which forces the intent.
   
   This is more restrictive than `select`, which does accept `Literal`. The 
asymmetry is intentional: a literal as a projection is meaningful (a constant 
column), but a literal as a filter is almost never what you want. Happy to 
revisit if you'd rather have `filter(lit(...))` accepted and the resulting 
no-op surface as an optimizer-time warning instead.
   
   ## Implementation
   
   Rust: `InternalDataFrame::filter(Vec<PyExpr>)` — empty list errors at the 
Python boundary; multi-element lists fold via `Expr::and`. Step-by-step 
comments explain the AND-folding and the choice not to stack N filter nodes.
   
   Python: `DataFrame.filter(*exprs: "Expr") -> DataFrame` with explicit 
`isinstance` checks rejecting `str`, `Literal`, and other types. `where = 
filter` as a class-level alias. Type annotation `*exprs: "Expr"` resolves via 
the existing TYPE_CHECKING import block.
   
   ## Test plan
   
   13 tests in `tests/expr/test_dataframe_filter.py`:
   
   - **Positive**: simple predicate, multi-AND, explicit `&`, `|`, `~`, `isin`, 
`where` alias produces identical output, chained `filter().filter()` matches 
multi-AND.
   - **Lazy**: filter returns a `DataFrame` without materializing.
   - **Errors**: empty filter → `ValueError`; string arg → `TypeError`; 
`Literal` arg → `TypeError` with the actionable suggestion; unknown column → 
`SedonaError` whose message includes the valid field names.
   
   All 13 pass locally. Existing select/expression/literal tests all still pass 
(64 total expr tests green). `ruff format` and doctests both clean.
   
   ## What's next
   
   This completes the issue-tracker Phase P1 (Expr layer + select/filter on 
DataFrame). Phase P2 in the [design 
doc](https://github.com/apache/sedona-db/issues/791) covers `Series` / 
`Scalar`, `groupby`, `merge`, the curated `sd.st` module, and the cookbook 
track.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat(python/sedonadb): add DataFrame.filter / DataFrame.where [sedona-db]

Reply via email to