zhangfengcdt opened a new pull request, #2831: URL: https://github.com/apache/sedona/pull/2831
## Did you read the Contributor Guide? - Yes, I have read the [Contributor Rules](https://sedona.apache.org/latest/community/rule/) and [Contributor Development Guide](https://sedona.apache.org/latest/community/develop/) ## Is this PR related to a ticket? - Yes, and the PR name follows the format `[GH-XXX] my subject`. Closes #<issue_number> ## What changes were proposed in this PR? Implements WKB-based Geography serialization (Option B: WKB with Cached S2) and a full set of Geography ST functions. **Core architecture:** - WKBGeography — stores WKB bytes as primary representation with lazy-parsed JTS, S2, and ShapeIndex caches (double-checked locking for thread safety) - GeographyWKBSerializer — WKB serializer with 0xFF format byte, backward-compatible with legacy S2-native format - GeographyUDT, implicits.scala, GeometrySerde — switched to WKBSerializer for all serialization paths **Geography functions (13 new):** - Level 1 (JTS): ST_AsText, ST_NPoints, ST_GeometryType, ST_NumGeometries, ST_Centroid - Level 2 (JTS + Spheroid): ST_Distance, ST_Area, ST_Length - Level 3 (S2): ST_MaxDistance, ST_ClosestPoint, ST_Contains, ST_Intersects, ST_Equals **Performance:** - ST_Distance uses S2ClosestEdgeQuery for true geometry-to-geometry distance (consistent with sedona-db) - ShapeIndex cached in WKBGeography — 2-6x faster for repeated S2 operations - Configurable spark.sedona.geography.eagerShapeIndex for predicate-heavy workloads - JMH benchmark module with 4 benchmark classes (single-call, serializer comparison, GeoParquet scenario, batch processing) **Docs**: API docs for all 13 new functions in docs/api/sql/geography/ **Note**: Geography-aware spatial join partitioning using S2 cells will be in a separate PR ## How was this patch tested? - 1032 unit tests pass in common module (28 new in WKBGeographyTest, 24 in FunctionTest) - GeographyFunctionTest.scala — 34 Spark SQL integration tests covering constructors, structural functions, metrics, predicates, DataFrame API, and serialization round-trips - JMH benchmarks verified across point, linestring, polygon (16/64/500 vertices) with GeoParquet scenario showing zero performance penalty vs S2-parse-from-WKB path ## Did this PR include necessary documentation updates? - Yes, I have updated the documentation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
