jiayuasu opened a new issue, #2877: URL: https://github.com/apache/sedona/issues/2877
## Background Sedona has no first-class bounding-box value type. `ST_Envelope` returns a polygon `Geometry`, and users reconstruct bboxes via `ST_MinX` / `ST_MaxX` / `ST_MinY` / `ST_MaxY`. This is awkward for common operations — bbox-from-geometry, dataset extent, GeoParquet covering columns, partition pruning. Sister project [`apache/sedona-db`](https://github.com/apache/sedona-db) has an internal `BoundingBox` (`rust/sedona-geometry/src/bounding_box.rs`) but doesn't expose it as a SQL type. PostGIS has `box2d` / `box3d`. GeoParquet 1.1 standardizes a `struct<xmin, ymin, xmax, ymax>` bbox covering column, which Sedona already reads/writes as a raw struct (`GeoParquetMetaData.scala`, `GeoParquetSpatialFilter.scala`). ## Plan Add `Box2D` as a native value type. Phase 1 covers the Spark/JVM side, Python and Flink mirrors, and GeoParquet writer integration. `Box3D` and geography bboxes are out of scope and tracked as follow-ups. ### Type `Box2DUDT` is a struct-backed UDT with `sqlType = struct<xmin: double, ymin: double, xmax: double, ymax: double>` (all non-nullable). Struct-backed (not binary-backed) so values round-trip natively to Parquet and align zero-copy with GeoParquet 1.1 bbox covering columns. Field names match the GeoParquet 1.1 spec and `sedona-db`'s GeoParquet writer. Empty boxes encoded as `xmin > xmax` (JTS `Envelope` convention). Empty acts as the identity for union/expand. Split `Box2D` / `Box3D` rather than a unified type with optional Z. Reasons: 1. GeoParquet 1.1 covering columns are 2D-only. A dedicated `Box2D` matches the spec bit-for-bit. 2. Storage: 32 bytes/row vs. ~56 bytes for a unified type with nullable Z. Material cost on `ST_Extent` shuffles. 3. Static dispatch for dimension-specific functions (`ST_Area(box2d)` vs `ST_Volume(box3d)`). 4. PostGIS familiarity. `Box3D` is deferred until a concrete need (point clouds, BIM, voxel data) lands. ### SQL surface (Phase 1) | Function | Signature | |---|---| | `ST_Box2D(geom)` | `Geometry → Box2D` | | `ST_MakeBox2D(point, point)` | `(Point, Point) → Box2D` | | `ST_Extent(geom)` | aggregate `Geometry → Box2D` | | `ST_XMin` / `ST_XMax` / `ST_YMin` / `ST_YMax(box2d)` | `Box2D → Double` (overload existing accessors) | | `CAST(box2d AS geometry)` | `Box2D → Polygon` | | `ST_AsText(box2d)` | `Box2D → 'BOX(x1 y1, x2 y2)'` | `ST_Envelope` keeps returning a polygon `Geometry` (no break). `ST_Envelope_Aggr` is left untouched. ### GeoParquet writer When emitting a `Box2D` column, write it as a native GeoParquet 1.1 bbox covering column. Float32 values, with `Math.nextUp` / `Math.nextDown` for conservative outward rounding so Float32 bounds always contain the Float64 truth (bit-compatible with `sedona-db`'s `next_after` approach in `rust/sedona-geoparquet/src/writer.rs`). ### Cross-language Python and Flink mirror the Phase 1 SQL surface in the same release. R deferred. ## Out of scope (follow-ups) - `ST_Expand(box, dx, dy)` - Box predicates (`ST_BoxIntersects`, `ST_BoxContains`) - Implicit `geometry → box2d` cast - `Box3D`, `ST_3DExtent` - Reader-side auto-materialization of GeoParquet bbox covering columns as `Box2D` (typed bbox columns from existing files with no migration; the reader path has more edge cases — legacy files, missing metadata, conflicting schemas — and is worth its own change) - Geography bboxes. `Geography` doesn't have a bbox type today; PostGIS doesn't expose one either. The likely path is reusing `Box2D` with antimeridian wraparound semantics (matching `sedona-db`'s `WraparoundInterval`), which conflicts with the empty marker and needs its own design. ## Coordination with sedona-db `sedona-db`'s GeoParquet writer uses `xmin/ymin/xmax/ymax` (Float32), but its `st_analyze_agg` returns `minx/miny/maxx/maxy` (Float64). Worth aligning on the Parquet-spec naming as part of this work. ## Implementation files (Phase 1 estimate) - `common/src/main/java/org/apache/sedona/common/geometryObjects/Box2D.java` (new) - `spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/Box2DUDT.scala` (new) - `spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/UdtRegistratorWrapper.scala` (register) - `spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Functions.scala` (`ST_Box2D`, accessor overloads) - `spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Constructors.scala` (`ST_MakeBox2D`) - `spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/AggregateFunctions.scala` (`ST_Extent`) - `spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/InferredExpression.scala` (Box2D type mapping) - `spark/common/src/main/scala/org/apache/sedona/sql/UDF/Catalog.scala` (register expressions) - `spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetWriteSupport.scala` (covering column writer) - `python/sedona/spark/sql/{st_functions,st_aggregates,types}.py` - `flink/src/main/java/org/apache/sedona/flink/expressions/{Catalog,Aggregators}.java` - Test suites in `spark/common/src/test`, `python/tests`, `flink/src/test`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
