[I] Add a native Box2D type for bounding boxes [sedona]

via GitHub Thu, 30 Apr 2026 22:58:52 -0700


jiayuasu opened a new issue, #2877:
URL: https://github.com/apache/sedona/issues/2877


   ## Background
   
   Sedona has no first-class bounding-box value type. `ST_Envelope` returns a 
polygon `Geometry`, and users reconstruct bboxes via `ST_MinX` / `ST_MaxX` / 
`ST_MinY` / `ST_MaxY`. This is awkward for common operations — 
bbox-from-geometry, dataset extent, GeoParquet covering columns, partition 
pruning.
   
   Sister project [`apache/sedona-db`](https://github.com/apache/sedona-db) has 
an internal `BoundingBox` (`rust/sedona-geometry/src/bounding_box.rs`) but 
doesn't expose it as a SQL type. PostGIS has `box2d` / `box3d`. GeoParquet 1.1 
standardizes a `struct<xmin, ymin, xmax, ymax>` bbox covering column, which 
Sedona already reads/writes as a raw struct (`GeoParquetMetaData.scala`, 
`GeoParquetSpatialFilter.scala`).
   
   ## Plan
   
   Add `Box2D` as a native value type. Phase 1 covers the Spark/JVM side, 
Python and Flink mirrors, and GeoParquet writer integration. `Box3D` and 
geography bboxes are out of scope and tracked as follow-ups.
   
   ### Type
   
   `Box2DUDT` is a struct-backed UDT with `sqlType = struct<xmin: double, ymin: 
double, xmax: double, ymax: double>` (all non-nullable). Struct-backed (not 
binary-backed) so values round-trip natively to Parquet and align zero-copy 
with GeoParquet 1.1 bbox covering columns.
   
   Field names match the GeoParquet 1.1 spec and `sedona-db`'s GeoParquet 
writer.
   
   Empty boxes encoded as `xmin > xmax` (JTS `Envelope` convention). Empty acts 
as the identity for union/expand.
   
   Split `Box2D` / `Box3D` rather than a unified type with optional Z. Reasons:
   
   1. GeoParquet 1.1 covering columns are 2D-only. A dedicated `Box2D` matches 
the spec bit-for-bit.
   2. Storage: 32 bytes/row vs. ~56 bytes for a unified type with nullable Z. 
Material cost on `ST_Extent` shuffles.
   3. Static dispatch for dimension-specific functions (`ST_Area(box2d)` vs 
`ST_Volume(box3d)`).
   4. PostGIS familiarity.
   
   `Box3D` is deferred until a concrete need (point clouds, BIM, voxel data) 
lands.
   
   ### SQL surface (Phase 1)
   
   | Function | Signature |
   |---|---|
   | `ST_Box2D(geom)` | `Geometry → Box2D` |
   | `ST_MakeBox2D(point, point)` | `(Point, Point) → Box2D` |
   | `ST_Extent(geom)` | aggregate `Geometry → Box2D` |
   | `ST_XMin` / `ST_XMax` / `ST_YMin` / `ST_YMax(box2d)` | `Box2D → Double` 
(overload existing accessors) |
   | `CAST(box2d AS geometry)` | `Box2D → Polygon` |
   | `ST_AsText(box2d)` | `Box2D → 'BOX(x1 y1, x2 y2)'` |
   
   `ST_Envelope` keeps returning a polygon `Geometry` (no break). 
`ST_Envelope_Aggr` is left untouched.
   
   ### GeoParquet writer
   
   When emitting a `Box2D` column, write it as a native GeoParquet 1.1 bbox 
covering column. Float32 values, with `Math.nextUp` / `Math.nextDown` for 
conservative outward rounding so Float32 bounds always contain the Float64 
truth (bit-compatible with `sedona-db`'s `next_after` approach in 
`rust/sedona-geoparquet/src/writer.rs`).
   
   ### Cross-language
   
   Python and Flink mirror the Phase 1 SQL surface in the same release. R 
deferred.
   
   ## Out of scope (follow-ups)
   
   - `ST_Expand(box, dx, dy)`
   - Box predicates (`ST_BoxIntersects`, `ST_BoxContains`)
   - Implicit `geometry → box2d` cast
   - `Box3D`, `ST_3DExtent`
   - Reader-side auto-materialization of GeoParquet bbox covering columns as 
`Box2D` (typed bbox columns from existing files with no migration; the reader 
path has more edge cases — legacy files, missing metadata, conflicting schemas 
— and is worth its own change)
   - Geography bboxes. `Geography` doesn't have a bbox type today; PostGIS 
doesn't expose one either. The likely path is reusing `Box2D` with antimeridian 
wraparound semantics (matching `sedona-db`'s `WraparoundInterval`), which 
conflicts with the empty marker and needs its own design.
   
   ## Coordination with sedona-db
   
   `sedona-db`'s GeoParquet writer uses `xmin/ymin/xmax/ymax` (Float32), but 
its `st_analyze_agg` returns `minx/miny/maxx/maxy` (Float64). Worth aligning on 
the Parquet-spec naming as part of this work.
   
   ## Implementation files (Phase 1 estimate)
   
   - `common/src/main/java/org/apache/sedona/common/geometryObjects/Box2D.java` 
(new)
   - 
`spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/Box2DUDT.scala`
 (new)
   - 
`spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/UdtRegistratorWrapper.scala`
 (register)
   - 
`spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Functions.scala`
 (`ST_Box2D`, accessor overloads)
   - 
`spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Constructors.scala`
 (`ST_MakeBox2D`)
   - 
`spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/AggregateFunctions.scala`
 (`ST_Extent`)
   - 
`spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/InferredExpression.scala`
 (Box2D type mapping)
   - `spark/common/src/main/scala/org/apache/sedona/sql/UDF/Catalog.scala` 
(register expressions)
   - 
`spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetWriteSupport.scala`
 (covering column writer)
   - `python/sedona/spark/sql/{st_functions,st_aggregates,types}.py`
   - 
`flink/src/main/java/org/apache/sedona/flink/expressions/{Catalog,Aggregators}.java`
   - Test suites in `spark/common/src/test`, `python/tests`, `flink/src/test`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Add a native Box2D type for bounding boxes [sedona]

Reply via email to