jiayuasu opened a new issue, #791:
URL: https://github.com/apache/sedona-db/issues/791
## Motivation
SedonaDB's Python package today is SQL-driven: users get a lazy `DataFrame`
from `sd.sql(...)`, `sd.read_parquet(...)`, or `sd.create_data_frame(...)`, and
the only way to transform it is to write another SQL string. That's correct and
composable but unfamiliar to the target audience — Python data scientists who
already use pandas and GeoPandas and expect chainable
`df.filter(...).select(...).group_by(...)` style code.
This issue tracks adding:
1. A chainable Python DataFrame API that compiles to DataFusion logical
plans, not to SQL strings.
2. A curated set of Python wrappers for spatial functions so users don't
have to write `sd.sql("SELECT ST_Buffer(...)")` for common operations.
3. A thin GeoPandas-flavored surface — `.geometry`, `.crs`, `.to_crs()`,
`.sjoin()` — covering the common GeoPandas workflows.
## Current state (baseline)
Relevant files:
- `python/sedonadb/python/sedonadb/context.py` — `SedonaContext` (`sql`,
`read_parquet`, `read_pyogrio`, `create_data_frame`, `view`, `register_udf`,
`funcs`).
- `python/sedonadb/python/sedonadb/dataframe.py` — `DataFrame` (`limit`,
`head`, `execute`, `count`, `show`, `to_arrow_table`, `to_pandas`,
`to_parquet`, `to_pyogrio`, `to_view`, `with_params`, `schema`, `columns`,
`explain`).
- `python/sedonadb/src/dataframe.rs` — `InternalDataFrame` PyO3 wrapper
around `datafusion::prelude::DataFrame`.
- `python/sedonadb/python/sedonadb/expr/literal.py` — `Literal` / `lit()`,
the only existing expression type.
- `python/sedonadb/python/sedonadb/functions/__init__.py` — `Functions`
accessor; today only exposes table functions.
Working in our favor:
- The Rust side already holds a full DataFusion `DataFrame`, which supports
`select`, `filter`, `with_column`, `join`, `sort`, `aggregate`, etc. We are
exposing existing capability, not adding it.
- Schema already tracks geometry columns and CRS.
- `to_pandas()` already returns a `GeoDataFrame` when a geometry column is
present. GeoParquet I/O and `pyogrio` write are done.
## Design decisions
| # | Decision |
|---|---|
| 1 | Model `Expr` and DataFrame ops on **datafusion-python** — mirrors the
Rust core; same plumbing, same semantics. |
| 2 | **Same `DataFrame` class** for tabular and geospatial; active geometry
tracked via schema metadata, not a subclass. |
| 3 | Ship a **curated** subset of spatial function wrappers, not all of
them. Provide `sd.st.call("ST_Whatever", ...)` as escape hatch. |
| 4 | **No `GeoSeries` type in v1**. A single-column `DataFrame` with
`.geometry` pointing at that column is the geometry-series representation. |
## Scope of v1
In scope:
- `Expr` type with operator overloads (`+`, `-`, `*`, `/`, `==`, `!=`, `<`,
`<=`, `>`, `>=`, `&`, `|`, `~`).
- `col()`, plus the curated spatial functions below. Reuse existing `lit()`.
- Chainable `DataFrame` ops: `select`, `filter`/`where`, `with_column`,
`drop`, `rename`, `sort_values`, `distinct`, `union`, `join`/`merge`,
`group_by().agg()`, plus existing `limit`/`head`.
- Pandas ergonomics on `DataFrame`: `__getitem__` (column / projection /
filter), `__setitem__` as sugar for `with_column`, `dtypes`.
- GeoPandas-flavored accessors: `.geometry`, `.crs`,
`.active_geometry_name`, `.set_geometry(col)`, `.to_crs(target)`,
`.sjoin(other, predicate, how)`.
- Tests + a docs page of GeoPandas → SedonaDB side-by-side recipes.
Curated `sd.st` v1 list (~30 functions):
- **Constructors**: `point`, `geom_from_wkt`, `geom_from_wkb`,
`geog_from_wkt`, `geog_from_wkb`, `make_line`.
- **Accessors**: `area`, `length`, `perimeter`, `centroid`, `envelope`,
`npoints`, `x`, `y`, `z`, `m`, `srid`, `is_valid`, `has_z`.
- **Predicates**: `intersects`, `contains`, `within`, `covers`,
`covered_by`, `touches`, `crosses`, `overlaps`, `dwithin`, `equals`.
- **Constructive**: `buffer`, `union`, `intersection`, `difference`,
`convex_hull`, `concave_hull`, `simplify`.
- **Transforms**: `transform`, `set_crs`, `flip_coordinates`.
- **Output**: `as_text`, `as_binary`, `as_geojson`.
## Milestones
- **M1** — Expression layer (`Expr`, `col`, operators). Core ops: `select`,
`filter`, `with_column`, `drop`, `rename`, `sort_values`. Pandas `__getitem__`
/ `__setitem__`. Tests.
- **M2** — `join`/`merge`, `group_by().agg()`, `distinct`, `union`. Tests.
- **M3** — `sd.st` curated module. Tests + doc examples.
- **M4** — GeoPandas facade: `.geometry`, `.crs`, `.set_geometry`,
`.to_crs`, `.sjoin`. Migration recipes doc.
- **M5** — Polish: `NotImplementedError` stubs for unsupported pandas
surface, `dtypes`, release notes entry, migration guide.
Each milestone is additive to the existing Python package — no feature flag
needed because the SQL surface is untouched.
## Implementation notes
### Rust side (PyO3)
Add to `python/sedonadb/src/`:
- `expr.rs` — `InternalExpr` holding `datafusion_expr::Expr`. Methods:
`col(name)`, constructors from Python literals, operator wrappers, `alias`,
`cast`, `is_null`, etc.
- `functions.rs` — `scalar_udf_call(name, args)` that looks up a UDF in the
context's registry and returns an `InternalExpr`. This is what the Python
`sd.st.*` wrappers call into.
- Extend `InternalDataFrame` with: `select(exprs)`, `filter(expr)`,
`with_column(name, expr)`, `drop(cols)`, `rename(map)`, `sort(exprs)`,
`distinct()`, `union(other, distinct)`, `join(other, left_on, right_on, how)`,
`aggregate(group_exprs, agg_exprs)`.
Each new method is a thin wrapper over the corresponding DataFusion
`DataFrame` method — no new query-engine code.
### `__setitem__` semantics
`df["x"] = expr` is ambiguous in an immutable model. We choose Polars'
approach: it is a shortcut for `df = df.with_column("x", expr)` and we document
clearly that it rebinds the local name and does not mutate any shared state.
### Active geometry resolution
Already exposed: `primary_geometry_column` on `InternalDataFrame`. Python
`.geometry`, `.crs`, `.to_crs`, `.sjoin` all resolve through that.
`set_geometry(col)` produces a new DataFrame whose schema metadata marks `col`
as the primary geometry.
### Error messages
When a user calls a pandas-only method we explicitly do not support (`.loc`,
`.iloc`, `.reset_index`, `.apply`, `.set_index`), raise `NotImplementedError`
with a one-line pointer to the migration guide section.
## Risks & open questions
- **Eager conversion surprises**. Users coming from pandas expect
materialization after every op; our frames stay lazy. Mitigated by keeping
`.show()` / `.to_pandas()` ergonomic and pointing to them in error messages and
docs.
- **`__eq__` overload collides with dict / set usage**. Standard trade-off
in DataFrame libraries (Polars, PySpark, Ibis). Document; provide an escape
hatch if needed.
- **`set_geometry` via field metadata** needs to round-trip through
DataFusion's plan layer. If DataFusion strips custom metadata on `select *`,
we'll need a small shim. Worth a spike before M4.
- **`__setitem__` semantics** — revisit if user feedback on M1 shows the
"returns a new frame but looks mutating" contract is too surprising.
## Non-goals (explicit)
Documented so future contributors don't re-propose these:
- Row labels, `Index`, `RangeIndex`, `MultiIndex`.
- `.loc`, `.iloc`, `.at`, `.iat`.
- `.apply`, `.applymap`, `.map` over rows.
- `GeoSeries` as a type in v1.
- `GeoDataFrame` as a subclass.
- `.plot()` — users go through `to_pandas()`.
- Mutable in-place operations (`df.sort_values(..., inplace=True)`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]