jiayuasu opened a new issue, #2949: URL: https://github.com/apache/sedona/issues/2949
Follow-up to #2938. ## Scope The current Box2D filter pushdown (`Box2DLeafFilter`) prunes files using the *geometry column's* recorded bbox, on the assumption that per-row Box2D values equal per-row geometry envelopes. This is sound for Box2D columns produced by `ST_Box2D(geom)` (Sedona's writer, and most users' workflows). It is **not sound** when the covering Box2D column is conservatively wider than the geometry — which the GeoParquet 1.1 spec permits (e.g., `apache/sedona-db`'s Float32 writer uses `next_after` rounding). This issue tracks the proper fix: prune using Parquet column statistics for the Box2D struct's `xmin/ymin/xmax/ymax` nested fields, which give a tight file-level bound on the Box2D values themselves regardless of how they relate to the geometry. ## Implementation outline - Extend the GeoParquet read path to expose per-file (and ideally per-row-group) statistics for the Box2D column's nested float/double fields. - Plumb those statistics into `Box2DLeafFilter.evaluate`, or replace `Box2DLeafFilter` with a stats-aware variant. - The pruning logic itself doesn't change: intersect the file-level union Box2D with the query Box2D. - Once this lands, the `spark.sedona.geoparquet.box2dFilterPushDown` opt-out conf added in #2938 can default to "always on" or be removed. ## Why deferred Parquet column statistics for nested struct fields require working with the Parquet `FooterFiles` API and the Spark `ParquetFileFormat` internals, which is a chunkier change than the recognition logic in #2938. Better to ship the SQL-surface recognition first (which covers the common case soundly) and follow up with the universal soundness fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
