james-willis opened a new pull request, #2903:
URL: https://github.com/apache/sedona/pull/2903

   ## Summary
   
   Fixes #2880.
   
   When a Spark partition has zero rows, Sedona's GeoParquet writer was 
emitting `bbox: [0, 0, 0, 0]` in the per-column `geo` metadata. Per the 
[GeoParquet 1.1 spec](https://geoparquet.org/releases/v1.1.0/), `bbox` is the 
bounding box of the geometries in the file and is **optional** (the spec says 
"if specified, MUST be encoded with an array..."). For a file with no 
geometries we should omit it, not fabricate a phantom extent at Null Island.
   
   The fabricated `[0, 0, 0, 0]` is actively harmful:
   - **Breaks bbox-based file pruning.** Sedona's own `GeoParquetSpatialFilter` 
(and any reader doing the same) treats the empty file as overlapping any AOI 
containing (0, 0) — including most lat/lon AOIs that straddle the equator/prime 
meridian.
   - **Corrupts dataset extent reporting.** Tools that aggregate per-file 
bboxes (or GDAL's `GetExtent()` on a single file) report (0, 0) as the layer 
extent.
   - **Spec non-compliance.** It asserts an extent that does not exist in the 
file.
   
   ## Commits
   
   1. **Writer:** Make `GeometryFieldMetaData.bbox` an `Option[Seq[Double]]` 
and emit `None` (which json4s omits from the JSON) when no geometries were 
observed. The accumulator at `GeometryColumnBoundingBox` was already correctly 
initialized to `(+Inf, +Inf, -Inf, -Inf)`; only the fallback at 
`GeoParquetWriteSupport.scala:248-254` needed fixing.
   2. **Reader:** In `GeoParquetSpatialFilter.LeafFilter.evaluate`, also treat 
the bbox as untrusted when `geometry_types` is empty. This is a defensive guard 
so already-written legacy files (with `bbox: [0, 0, 0, 0]` and `geometry_types: 
[]`) stop poisoning Sedona's own spatial pruning. Per the GeoParquet spec, an 
empty `geometry_types` array signals "geometry types are not known," and a 
writer that didn't observe any geometries cannot have observed a real bbox 
either.
   
   ## Impact on other GeoParquet readers
   
   I researched how other major readers handle a missing/null `bbox` vs. `[0, 
0, 0, 0]`. **Omitting the field is strictly an improvement** in every case:
   
   | Reader | Uses column-metadata `bbox` for pruning? | Tolerates missing 
`bbox`? | Effect of `[0,0,0,0]` today |
   |---|---|---|---|
   | **DuckDB Spatial** (`ST_Read` / `read_parquet`) | No — pruning uses 
Parquet column stats on a separate `bbox` struct column ([discussion 
#484](https://github.com/duckdb/duckdb-spatial/discussions/484)) | Yes | No 
effect on query results |
   | **GeoPandas** (`read_parquet` in `geopandas/io/arrow.py`) | No — `bbox=` 
filter routes through `covering.bbox` struct column or a `point` encoding | 
Yes; in fact GeoPandas itself already omits bbox for all-NA columns (`if 
np.isfinite(bbox).all(): ...`) | No effect on query results |
   | **GDAL/OGR Parquet driver** | No — pruning uses `bbox` struct column / 
`covering.bbox` ([driver 
docs](https://gdal.org/en/stable/drivers/vector/parquet.html)) | Yes | Corrupts 
`GetExtent()` reporting (returns origin) |
   | **pyarrow** | N/A — surfaces the JSON, doesn't interpret it | Yes | None |
   | **stac-geoparquet** aggregator | Not yet (TODO in `_to_parquet.py`) | N/A 
| Would corrupt collection bbox unions once implemented |
   | **Sedona** (this PR) | Yes — `GeoParquetSpatialFilter.LeafFilter` | Yes 
(after this PR) | False non-prunes for AOIs near origin |
   
   GeoPandas' existing behavior is precedent — it already omits `bbox` for 
empty/all-NA inputs.
   
   ## Test plan
   
   - [x] Updated `geoparquetIOTests` "GeoParquet save should work with empty 
dataframes": now asserts the `bbox` field is omitted from the JSON (was 
previously asserting `Seq(0.0, 0.0, 0.0, 0.0)`).
   - [x] New unit test in `GeoParquetSpatialFilterPushDownSuite` covering 
`LeafFilter.evaluate` for (a) legacy buggy metadata `bbox=[0,0,0,0]` + 
`geometry_types=[]` (must not prune), (b) spec-compliant missing `bbox` (must 
not prune), (c) real bbox far from query window (must prune — sanity).
   - [ ] CI to run the rest of `geoparquet*` and `Stac*` suites (touched 
`StacBatch.scala` and the v2 metadata partition reader factory across all four 
spark-{3.4,3.5,4.0,4.1} variants).
   
   ## Spec reference
   
   GeoParquet 1.1 (column metadata for the geometry column):
   > **bbox**: Bounding Box of the geometries in the file, formatted according 
to RFC 7946, section 5.
   >
   > The bbox, **if specified**, MUST be encoded with an array...
   
   — https://geoparquet.org/releases/v1.1.0/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to