jiayuasu opened a new pull request, #2946:
URL: https://github.com/apache/sedona/pull/2946

   ## Did you read the Contributor Guide?
   
   - Yes, I have read the [Contributor 
Rules](https://sedona.apache.org/latest/community/rule/) and [Contributor 
Development Guide](https://sedona.apache.org/latest/community/develop/)
   
   ## Is this PR related to a ticket?
   
   - Yes, and the PR name follows the format `[GH-XXX] my subject`. Closes #2938
   
   ## What changes were proposed in this PR?
   
   Teaches `SpatialFilterPushDownForGeoParquet` to recognize the Box2D 
predicates from #2926 and translate them into a new `Box2DLeafFilter` that 
prunes files via the GeoParquet 1.1 bbox metadata.
   
   ### How the pushdown works
   
   A `Box2D`-typed column in a GeoParquet file is registered as the **covering 
column** of a geometry column (the writer in #2886 does this for both 
auto-detected `<geom>_bbox` columns and explicit 
`geoparquet.covering.<geom>=<col>` options). At pushdown time, the rule 
recognizes:
   
   - `ST_BoxIntersects(box_col, lit_box)` (and reverse arg order)
   - `ST_BoxContains(box_col, lit_box)` (and reverse arg order)
   
   …and produces a `Box2DLeafFilter(box_col, lit_box)`. At evaluation time the 
filter walks the file's column metadata, finds the geometry column whose 
`covering.bbox.xmin[0]` equals the predicate's Box2D column name, and prunes 
using that geometry column's recorded bbox.
   
   ### Why both predicates push down as INTERSECTS at the file level
   
   Per-row containment implies per-row intersection, which implies the file's 
union envelope must intersect the query box. So `ST_BoxContains(box_col, 
lit_box)` is sound to push down using the same intersection check as 
`ST_BoxIntersects` — the per-row containment refinement happens after the file 
is loaded. The pushdown is **sound** (never produces false negatives); it may 
be slightly conservative when per-row Box2D values are wider than the 
geometries they cover (e.g., user-supplied bboxes rather than auto-derived 
envelopes), but in the typical case where the Box2D column is `ST_Box2D(geom)` 
the pruning is near-optimal.
   
   ### Why no `LeafFilter` reuse
   
   The existing `LeafFilter` is keyed by the geometry column name. Our 
predicate carries the Box2D column name. Resolving Box2D → geometry happens 
per-file (since the covering link lives in per-file GeoParquet metadata, not 
the Spark schema), so the new `Box2DLeafFilter` does the lookup at `evaluate` 
time.
   
   ### Pairs naturally with
   
   The deferred GeoParquet reader auto-materialization of bbox covering columns 
as `Box2D` (in #2877's deferred follow-ups). When that lands, `WHERE 
ST_BoxIntersects(box_col, lit(b))` becomes the canonical bbox-pruned read path 
— the typed column comes from disk, the predicate prunes the disk read.
   
   ## How was this patch tested?
   
   `GeoParquetSpatialFilterPushDownSuite`:
   - "Push down ST_BoxIntersects against a Box2D covering column" — Q1-only, 
left-half-only, and fully-disjoint query windows against a quadrant-partitioned 
dataset. Verifies (a) the filter is recognized and pushed down, (b) the right 
files survive evaluation.
   - "Push down ST_BoxContains against a Box2D covering column" — tiny query 
box inside Q1, verifies INTERSECTS-style file-level pruning.
   
   The fixture is a copy of the existing `df` with `withColumn("geom_bbox", 
expr("ST_Box2D(geom)"))` so the writer auto-detects it as the covering column 
for `geom`.
   
   ## What's not in scope
   
   - **Two-sided pushdown** (`ST_BoxIntersects(box_a, box_b)` between two 
columns) — that's the spatial join planner work in #2939.
   - **Mixed Box2D / Geometry predicates** — waits on the implicit cast in 
#2927 or explicit mixed overloads.
   
   ## Did this PR include necessary documentation updates?
   
   - No, this PR does not affect any public SQL API documentation surface in 
isolation. Documentation lands with the consolidated Phase 1+2+3 Box2D docs 
update.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to