Re: [PR] [GH-2938] Push down ST_BoxIntersects / ST_BoxContains via Parquet row-group statistics [sedona]

via GitHub Mon, 11 May 2026 22:45:14 -0700


Copilot commented on code in PR #2946:
URL: https://github.com/apache/sedona/pull/2946#discussion_r3223986873



##########
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/optimization/SpatialFilterPushDownForGeoParquet.scala:
##########
@@ -256,6 +278,16 @@ class SpatialFilterPushDownForGeoParquet(sparkSession: 
SparkSession) extends Rul
     parseColumnPath(name).mkString(".")
   }
 
+  /**
+   * Extract a [[Box2D]] from a Catalyst literal value. Box2DUDT serializes to 
an InternalRow of
+   * four doubles; if the value is something else, the predicate is not 
pushable.
+   */
+  private def extractBox2DLiteral(value: Any): Option[Box2D] = value match {
+    case row: InternalRow if row.numFields == 4 =>
+      Some(new Box2D(row.getDouble(0), row.getDouble(1), row.getDouble(2), 
row.getDouble(3)))

Review Comment:
   `extractBox2DLiteral` will push down even if the literal has inverted bounds 
(xmin>xmax or ymin>ymax). `ST_BoxIntersects`/`ST_BoxContains` are documented to 
throw on inverted boxes; if the Parquet predicate prunes all row groups, the 
query can incorrectly return empty results without ever evaluating the 
expression (hiding the expected exception). Consider validating the extracted 
Box2D has ordered bounds and returning `None` (no pushdown) when it is inverted 
so runtime semantics are preserved.
   



##########
spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetFileFormat.scala:
##########
@@ -273,6 +273,15 @@ class GeoParquetFileFormat(val spatialFilter: 
Option[GeoParquetSpatialFilter])
         None
       }
 
+      // Spatial filters that translate to Parquet row-group predicates (e.g. 
Box2D bounds
+      // comparisons on a Box2D-typed column) are AND'd into the pushed-down 
filter so Parquet
+      // can skip row groups whose column statistics disprove them.
+      val combinedPushed = spatialFilter.flatMap(_.toParquetFilter) match {
+        case Some(spatialPredicate) =>
+          Some(pushed.fold(spatialPredicate)(p => FilterApi.and(p, 
spatialPredicate)))
+        case None => pushed
+      }

Review Comment:
   `combinedPushed` applies `spatialFilter.toParquetFilter` even when 
`enableParquetFilterPushDown` is false. That bypasses the Spark SQL setting 
`spark.sql.parquet.filterPushdown` (and contradicts the nearby “Try to push 
down filters when filter push-down is enabled” comment), potentially surprising 
users who disabled Parquet predicate pushdown for correctness/debugging. 
Consider gating the spatial Parquet predicate on `enableParquetFilterPushDown` 
(or a dedicated Sedona conf) so row-group predicates are not injected when 
Parquet pushdown is disabled.
   



##########
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/optimization/SpatialFilterPushDownForGeoParquet.scala:
##########
@@ -144,6 +148,24 @@ class SpatialFilterPushDownForGeoParquet(sparkSession: 
SparkSession) extends Rul
             SpatialPredicate.INTERSECTS,
             GeometryUDT.deserialize(value))
 
+      // Box2D predicates push down as Parquet row-group filters on the Box2D 
column's underlying
+      // (xmin, ymin, xmax, ymax) double leaves. Pruning is done by Parquet's 
stats-based skipping
+      // against the column's recorded min/max, which is sound regardless of 
how the writer chose
+      // the per-row Box2D values.
+      case ST_BoxIntersects(_) =>
+        // Intersects is symmetric — both argument orders produce the same 
predicate.

Review Comment:
   The PR description mentions a `spark.sedona.geoparquet.box2dFilterPushDown` 
opt-in and describes pruning via GeoParquet geometry bbox metadata, but the 
current implementation always translates Box2D predicates into Parquet 
`FilterPredicate`s over `box_col.{xmin,ymin,xmax,ymax}` (no separate conf and 
no GeoParquet-metadata-based pruning). Please align the PR description (and/or 
add the intended config/behavior) so users aren’t misled about what is actually 
being pushed down and when.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GH-2938] Push down ST_BoxIntersects / ST_BoxContains via Parquet row-group statistics [sedona]

Reply via email to