jiayuasu opened a new issue, #791:
URL: https://github.com/apache/sedona-db/issues/791

   ## Motivation
   
   SedonaDB's Python package today is SQL-driven: users get a lazy `DataFrame` 
from `sd.sql(...)`, `sd.read_parquet(...)`, or `sd.create_data_frame(...)`, and 
the only way to transform it is to write another SQL string. That's correct and 
composable but unfamiliar to the target audience — Python data scientists who 
already use pandas and GeoPandas and expect chainable 
`df.filter(...).select(...).group_by(...)` style code.
   
   This issue tracks adding:
   
   1. A chainable Python DataFrame API that compiles to DataFusion logical 
plans, not to SQL strings.
   2. A curated set of Python wrappers for spatial functions so users don't 
have to write `sd.sql("SELECT ST_Buffer(...)")` for common operations.
   3. A thin GeoPandas-flavored surface — `.geometry`, `.crs`, `.to_crs()`, 
`.sjoin()` — covering the common GeoPandas workflows.
   
   ## Current state (baseline)
   
   Relevant files:
   
   - `python/sedonadb/python/sedonadb/context.py` — `SedonaContext` (`sql`, 
`read_parquet`, `read_pyogrio`, `create_data_frame`, `view`, `register_udf`, 
`funcs`).
   - `python/sedonadb/python/sedonadb/dataframe.py` — `DataFrame` (`limit`, 
`head`, `execute`, `count`, `show`, `to_arrow_table`, `to_pandas`, 
`to_parquet`, `to_pyogrio`, `to_view`, `with_params`, `schema`, `columns`, 
`explain`).
   - `python/sedonadb/src/dataframe.rs` — `InternalDataFrame` PyO3 wrapper 
around `datafusion::prelude::DataFrame`.
   - `python/sedonadb/python/sedonadb/expr/literal.py` — `Literal` / `lit()`, 
the only existing expression type.
   - `python/sedonadb/python/sedonadb/functions/__init__.py` — `Functions` 
accessor; today only exposes table functions.
   
   Working in our favor:
   
   - The Rust side already holds a full DataFusion `DataFrame`, which supports 
`select`, `filter`, `with_column`, `join`, `sort`, `aggregate`, etc. We are 
exposing existing capability, not adding it.
   - Schema already tracks geometry columns and CRS.
   - `to_pandas()` already returns a `GeoDataFrame` when a geometry column is 
present. GeoParquet I/O and `pyogrio` write are done.
   
   ## Design decisions
   
   | # | Decision |
   |---|---|
   | 1 | Model `Expr` and DataFrame ops on **datafusion-python** — mirrors the 
Rust core; same plumbing, same semantics. |
   | 2 | **Same `DataFrame` class** for tabular and geospatial; active geometry 
tracked via schema metadata, not a subclass. |
   | 3 | Ship a **curated** subset of spatial function wrappers, not all of 
them. Provide `sd.st.call("ST_Whatever", ...)` as escape hatch. |
   | 4 | **No `GeoSeries` type in v1**. A single-column `DataFrame` with 
`.geometry` pointing at that column is the geometry-series representation. |
   
   ## Scope of v1
   
   In scope:
   
   - `Expr` type with operator overloads (`+`, `-`, `*`, `/`, `==`, `!=`, `<`, 
`<=`, `>`, `>=`, `&`, `|`, `~`).
   - `col()`, plus the curated spatial functions below. Reuse existing `lit()`.
   - Chainable `DataFrame` ops: `select`, `filter`/`where`, `with_column`, 
`drop`, `rename`, `sort_values`, `distinct`, `union`, `join`/`merge`, 
`group_by().agg()`, plus existing `limit`/`head`.
   - Pandas ergonomics on `DataFrame`: `__getitem__` (column / projection / 
filter), `__setitem__` as sugar for `with_column`, `dtypes`.
   - GeoPandas-flavored accessors: `.geometry`, `.crs`, 
`.active_geometry_name`, `.set_geometry(col)`, `.to_crs(target)`, 
`.sjoin(other, predicate, how)`.
   - Tests + a docs page of GeoPandas → SedonaDB side-by-side recipes.
   
   Curated `sd.st` v1 list (~30 functions):
   
   - **Constructors**: `point`, `geom_from_wkt`, `geom_from_wkb`, 
`geog_from_wkt`, `geog_from_wkb`, `make_line`.
   - **Accessors**: `area`, `length`, `perimeter`, `centroid`, `envelope`, 
`npoints`, `x`, `y`, `z`, `m`, `srid`, `is_valid`, `has_z`.
   - **Predicates**: `intersects`, `contains`, `within`, `covers`, 
`covered_by`, `touches`, `crosses`, `overlaps`, `dwithin`, `equals`.
   - **Constructive**: `buffer`, `union`, `intersection`, `difference`, 
`convex_hull`, `concave_hull`, `simplify`.
   - **Transforms**: `transform`, `set_crs`, `flip_coordinates`.
   - **Output**: `as_text`, `as_binary`, `as_geojson`.
   
   ## Milestones
   
   - **M1** — Expression layer (`Expr`, `col`, operators). Core ops: `select`, 
`filter`, `with_column`, `drop`, `rename`, `sort_values`. Pandas `__getitem__` 
/ `__setitem__`. Tests.
   - **M2** — `join`/`merge`, `group_by().agg()`, `distinct`, `union`. Tests.
   - **M3** — `sd.st` curated module. Tests + doc examples.
   - **M4** — GeoPandas facade: `.geometry`, `.crs`, `.set_geometry`, 
`.to_crs`, `.sjoin`. Migration recipes doc.
   - **M5** — Polish: `NotImplementedError` stubs for unsupported pandas 
surface, `dtypes`, release notes entry, migration guide.
   
   Each milestone is additive to the existing Python package — no feature flag 
needed because the SQL surface is untouched.
   
   ## Implementation notes
   
   ### Rust side (PyO3)
   
   Add to `python/sedonadb/src/`:
   
   - `expr.rs` — `InternalExpr` holding `datafusion_expr::Expr`. Methods: 
`col(name)`, constructors from Python literals, operator wrappers, `alias`, 
`cast`, `is_null`, etc.
   - `functions.rs` — `scalar_udf_call(name, args)` that looks up a UDF in the 
context's registry and returns an `InternalExpr`. This is what the Python 
`sd.st.*` wrappers call into.
   - Extend `InternalDataFrame` with: `select(exprs)`, `filter(expr)`, 
`with_column(name, expr)`, `drop(cols)`, `rename(map)`, `sort(exprs)`, 
`distinct()`, `union(other, distinct)`, `join(other, left_on, right_on, how)`, 
`aggregate(group_exprs, agg_exprs)`.
   
   Each new method is a thin wrapper over the corresponding DataFusion 
`DataFrame` method — no new query-engine code.
   
   ### `__setitem__` semantics
   
   `df["x"] = expr` is ambiguous in an immutable model. We choose Polars' 
approach: it is a shortcut for `df = df.with_column("x", expr)` and we document 
clearly that it rebinds the local name and does not mutate any shared state.
   
   ### Active geometry resolution
   
   Already exposed: `primary_geometry_column` on `InternalDataFrame`. Python 
`.geometry`, `.crs`, `.to_crs`, `.sjoin` all resolve through that. 
`set_geometry(col)` produces a new DataFrame whose schema metadata marks `col` 
as the primary geometry.
   
   ### Error messages
   
   When a user calls a pandas-only method we explicitly do not support (`.loc`, 
`.iloc`, `.reset_index`, `.apply`, `.set_index`), raise `NotImplementedError` 
with a one-line pointer to the migration guide section.
   
   ## Risks & open questions
   
   - **Eager conversion surprises**. Users coming from pandas expect 
materialization after every op; our frames stay lazy. Mitigated by keeping 
`.show()` / `.to_pandas()` ergonomic and pointing to them in error messages and 
docs.
   - **`__eq__` overload collides with dict / set usage**. Standard trade-off 
in DataFrame libraries (Polars, PySpark, Ibis). Document; provide an escape 
hatch if needed.
   - **`set_geometry` via field metadata** needs to round-trip through 
DataFusion's plan layer. If DataFusion strips custom metadata on `select *`, 
we'll need a small shim. Worth a spike before M4.
   - **`__setitem__` semantics** — revisit if user feedback on M1 shows the 
"returns a new frame but looks mutating" contract is too surprising.
   
   ## Non-goals (explicit)
   
   Documented so future contributors don't re-propose these:
   
   - Row labels, `Index`, `RangeIndex`, `MultiIndex`.
   - `.loc`, `.iloc`, `.at`, `.iat`.
   - `.apply`, `.applymap`, `.map` over rows.
   - `GeoSeries` as a type in v1.
   - `GeoDataFrame` as a subclass.
   - `.plot()` — users go through `to_pandas()`.
   - Mutable in-place operations (`df.sort_values(..., inplace=True)`).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to