Copilot commented on code in PR #2946:
URL: https://github.com/apache/sedona/pull/2946#discussion_r3223986873
##########
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/optimization/SpatialFilterPushDownForGeoParquet.scala:
##########
@@ -256,6 +278,16 @@ class SpatialFilterPushDownForGeoParquet(sparkSession:
SparkSession) extends Rul
parseColumnPath(name).mkString(".")
}
+ /**
+ * Extract a [[Box2D]] from a Catalyst literal value. Box2DUDT serializes to
an InternalRow of
+ * four doubles; if the value is something else, the predicate is not
pushable.
+ */
+ private def extractBox2DLiteral(value: Any): Option[Box2D] = value match {
+ case row: InternalRow if row.numFields == 4 =>
+ Some(new Box2D(row.getDouble(0), row.getDouble(1), row.getDouble(2),
row.getDouble(3)))
Review Comment:
`extractBox2DLiteral` will push down even if the literal has inverted bounds
(xmin>xmax or ymin>ymax). `ST_BoxIntersects`/`ST_BoxContains` are documented to
throw on inverted boxes; if the Parquet predicate prunes all row groups, the
query can incorrectly return empty results without ever evaluating the
expression (hiding the expected exception). Consider validating the extracted
Box2D has ordered bounds and returning `None` (no pushdown) when it is inverted
so runtime semantics are preserved.
##########
spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetFileFormat.scala:
##########
@@ -273,6 +273,15 @@ class GeoParquetFileFormat(val spatialFilter:
Option[GeoParquetSpatialFilter])
None
}
+ // Spatial filters that translate to Parquet row-group predicates (e.g.
Box2D bounds
+ // comparisons on a Box2D-typed column) are AND'd into the pushed-down
filter so Parquet
+ // can skip row groups whose column statistics disprove them.
+ val combinedPushed = spatialFilter.flatMap(_.toParquetFilter) match {
+ case Some(spatialPredicate) =>
+ Some(pushed.fold(spatialPredicate)(p => FilterApi.and(p,
spatialPredicate)))
+ case None => pushed
+ }
Review Comment:
`combinedPushed` applies `spatialFilter.toParquetFilter` even when
`enableParquetFilterPushDown` is false. That bypasses the Spark SQL setting
`spark.sql.parquet.filterPushdown` (and contradicts the nearby “Try to push
down filters when filter push-down is enabled” comment), potentially surprising
users who disabled Parquet predicate pushdown for correctness/debugging.
Consider gating the spatial Parquet predicate on `enableParquetFilterPushDown`
(or a dedicated Sedona conf) so row-group predicates are not injected when
Parquet pushdown is disabled.
##########
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/optimization/SpatialFilterPushDownForGeoParquet.scala:
##########
@@ -144,6 +148,24 @@ class SpatialFilterPushDownForGeoParquet(sparkSession:
SparkSession) extends Rul
SpatialPredicate.INTERSECTS,
GeometryUDT.deserialize(value))
+ // Box2D predicates push down as Parquet row-group filters on the Box2D
column's underlying
+ // (xmin, ymin, xmax, ymax) double leaves. Pruning is done by Parquet's
stats-based skipping
+ // against the column's recorded min/max, which is sound regardless of
how the writer chose
+ // the per-row Box2D values.
+ case ST_BoxIntersects(_) =>
+ // Intersects is symmetric — both argument orders produce the same
predicate.
Review Comment:
The PR description mentions a `spark.sedona.geoparquet.box2dFilterPushDown`
opt-in and describes pruning via GeoParquet geometry bbox metadata, but the
current implementation always translates Box2D predicates into Parquet
`FilterPredicate`s over `box_col.{xmin,ymin,xmax,ymax}` (no separate conf and
no GeoParquet-metadata-based pruning). Please align the PR description (and/or
add the intended config/behavior) so users aren’t misled about what is actually
being pushed down and when.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]