jiayuasu opened a new pull request, #2946: URL: https://github.com/apache/sedona/pull/2946
## Did you read the Contributor Guide? - Yes, I have read the [Contributor Rules](https://sedona.apache.org/latest/community/rule/) and [Contributor Development Guide](https://sedona.apache.org/latest/community/develop/) ## Is this PR related to a ticket? - Yes, and the PR name follows the format `[GH-XXX] my subject`. Closes #2938 ## What changes were proposed in this PR? Teaches `SpatialFilterPushDownForGeoParquet` to recognize the Box2D predicates from #2926 and translate them into a new `Box2DLeafFilter` that prunes files via the GeoParquet 1.1 bbox metadata. ### How the pushdown works A `Box2D`-typed column in a GeoParquet file is registered as the **covering column** of a geometry column (the writer in #2886 does this for both auto-detected `<geom>_bbox` columns and explicit `geoparquet.covering.<geom>=<col>` options). At pushdown time, the rule recognizes: - `ST_BoxIntersects(box_col, lit_box)` (and reverse arg order) - `ST_BoxContains(box_col, lit_box)` (and reverse arg order) …and produces a `Box2DLeafFilter(box_col, lit_box)`. At evaluation time the filter walks the file's column metadata, finds the geometry column whose `covering.bbox.xmin[0]` equals the predicate's Box2D column name, and prunes using that geometry column's recorded bbox. ### Why both predicates push down as INTERSECTS at the file level Per-row containment implies per-row intersection, which implies the file's union envelope must intersect the query box. So `ST_BoxContains(box_col, lit_box)` is sound to push down using the same intersection check as `ST_BoxIntersects` — the per-row containment refinement happens after the file is loaded. The pushdown is **sound** (never produces false negatives); it may be slightly conservative when per-row Box2D values are wider than the geometries they cover (e.g., user-supplied bboxes rather than auto-derived envelopes), but in the typical case where the Box2D column is `ST_Box2D(geom)` the pruning is near-optimal. ### Why no `LeafFilter` reuse The existing `LeafFilter` is keyed by the geometry column name. Our predicate carries the Box2D column name. Resolving Box2D → geometry happens per-file (since the covering link lives in per-file GeoParquet metadata, not the Spark schema), so the new `Box2DLeafFilter` does the lookup at `evaluate` time. ### Pairs naturally with The deferred GeoParquet reader auto-materialization of bbox covering columns as `Box2D` (in #2877's deferred follow-ups). When that lands, `WHERE ST_BoxIntersects(box_col, lit(b))` becomes the canonical bbox-pruned read path — the typed column comes from disk, the predicate prunes the disk read. ## How was this patch tested? `GeoParquetSpatialFilterPushDownSuite`: - "Push down ST_BoxIntersects against a Box2D covering column" — Q1-only, left-half-only, and fully-disjoint query windows against a quadrant-partitioned dataset. Verifies (a) the filter is recognized and pushed down, (b) the right files survive evaluation. - "Push down ST_BoxContains against a Box2D covering column" — tiny query box inside Q1, verifies INTERSECTS-style file-level pruning. The fixture is a copy of the existing `df` with `withColumn("geom_bbox", expr("ST_Box2D(geom)"))` so the writer auto-detects it as the covering column for `geom`. ## What's not in scope - **Two-sided pushdown** (`ST_BoxIntersects(box_a, box_b)` between two columns) — that's the spatial join planner work in #2939. - **Mixed Box2D / Geometry predicates** — waits on the implicit cast in #2927 or explicit mixed overloads. ## Did this PR include necessary documentation updates? - No, this PR does not affect any public SQL API documentation surface in isolation. Documentation lands with the consolidated Phase 1+2+3 Box2D docs update. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
