james-willis opened a new pull request, #2903: URL: https://github.com/apache/sedona/pull/2903
## Summary Fixes #2880. When a Spark partition has zero rows, Sedona's GeoParquet writer was emitting `bbox: [0, 0, 0, 0]` in the per-column `geo` metadata. Per the [GeoParquet 1.1 spec](https://geoparquet.org/releases/v1.1.0/), `bbox` is the bounding box of the geometries in the file and is **optional** (the spec says "if specified, MUST be encoded with an array..."). For a file with no geometries we should omit it, not fabricate a phantom extent at Null Island. The fabricated `[0, 0, 0, 0]` is actively harmful: - **Breaks bbox-based file pruning.** Sedona's own `GeoParquetSpatialFilter` (and any reader doing the same) treats the empty file as overlapping any AOI containing (0, 0) — including most lat/lon AOIs that straddle the equator/prime meridian. - **Corrupts dataset extent reporting.** Tools that aggregate per-file bboxes (or GDAL's `GetExtent()` on a single file) report (0, 0) as the layer extent. - **Spec non-compliance.** It asserts an extent that does not exist in the file. ## Commits 1. **Writer:** Make `GeometryFieldMetaData.bbox` an `Option[Seq[Double]]` and emit `None` (which json4s omits from the JSON) when no geometries were observed. The accumulator at `GeometryColumnBoundingBox` was already correctly initialized to `(+Inf, +Inf, -Inf, -Inf)`; only the fallback at `GeoParquetWriteSupport.scala:248-254` needed fixing. 2. **Reader:** In `GeoParquetSpatialFilter.LeafFilter.evaluate`, also treat the bbox as untrusted when `geometry_types` is empty. This is a defensive guard so already-written legacy files (with `bbox: [0, 0, 0, 0]` and `geometry_types: []`) stop poisoning Sedona's own spatial pruning. Per the GeoParquet spec, an empty `geometry_types` array signals "geometry types are not known," and a writer that didn't observe any geometries cannot have observed a real bbox either. ## Impact on other GeoParquet readers I researched how other major readers handle a missing/null `bbox` vs. `[0, 0, 0, 0]`. **Omitting the field is strictly an improvement** in every case: | Reader | Uses column-metadata `bbox` for pruning? | Tolerates missing `bbox`? | Effect of `[0,0,0,0]` today | |---|---|---|---| | **DuckDB Spatial** (`ST_Read` / `read_parquet`) | No — pruning uses Parquet column stats on a separate `bbox` struct column ([discussion #484](https://github.com/duckdb/duckdb-spatial/discussions/484)) | Yes | No effect on query results | | **GeoPandas** (`read_parquet` in `geopandas/io/arrow.py`) | No — `bbox=` filter routes through `covering.bbox` struct column or a `point` encoding | Yes; in fact GeoPandas itself already omits bbox for all-NA columns (`if np.isfinite(bbox).all(): ...`) | No effect on query results | | **GDAL/OGR Parquet driver** | No — pruning uses `bbox` struct column / `covering.bbox` ([driver docs](https://gdal.org/en/stable/drivers/vector/parquet.html)) | Yes | Corrupts `GetExtent()` reporting (returns origin) | | **pyarrow** | N/A — surfaces the JSON, doesn't interpret it | Yes | None | | **stac-geoparquet** aggregator | Not yet (TODO in `_to_parquet.py`) | N/A | Would corrupt collection bbox unions once implemented | | **Sedona** (this PR) | Yes — `GeoParquetSpatialFilter.LeafFilter` | Yes (after this PR) | False non-prunes for AOIs near origin | GeoPandas' existing behavior is precedent — it already omits `bbox` for empty/all-NA inputs. ## Test plan - [x] Updated `geoparquetIOTests` "GeoParquet save should work with empty dataframes": now asserts the `bbox` field is omitted from the JSON (was previously asserting `Seq(0.0, 0.0, 0.0, 0.0)`). - [x] New unit test in `GeoParquetSpatialFilterPushDownSuite` covering `LeafFilter.evaluate` for (a) legacy buggy metadata `bbox=[0,0,0,0]` + `geometry_types=[]` (must not prune), (b) spec-compliant missing `bbox` (must not prune), (c) real bbox far from query window (must prune — sanity). - [ ] CI to run the rest of `geoparquet*` and `Stac*` suites (touched `StacBatch.scala` and the v2 metadata partition reader factory across all four spark-{3.4,3.5,4.0,4.1} variants). ## Spec reference GeoParquet 1.1 (column metadata for the geometry column): > **bbox**: Bounding Box of the geometries in the file, formatted according to RFC 7946, section 5. > > The bbox, **if specified**, MUST be encoded with an array... — https://geoparquet.org/releases/v1.1.0/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
