zhangfengcdt commented on PR #2831:
URL: https://github.com/apache/sedona/pull/2831#issuecomment-4254345501
> This seems like it is headed in the right direction!
>
> The `WKBGeography` is the right direction I think (In C++ I call this a
GeoArrowGeography and it's slightly more general but the same idea). As
written, I am not sure it is able to make anything faster (I am guessing that
Spark already handles `filter()`s and won't materialize a Java object unless it
will actually be used for something).
>
> This may not be true in Java, but in C++ the S2Polygon and maybe
S2Polyline has an internal index for itself and also one for each loop (lazily
constructed, but almost always needed for initializing from WKB to figure out
the nesting of shells/holes). The optimization of implementing the S2Shape
interface directly on WKB is pretty much 100% to avoid those extra index builds
(so that for any given function there's at most one shape index per object that
is built). I don't know how flexible the S2Shape is in Java (in C++
implementing it was not too bad). I think you would need to do something
similar to see improved performance (but maybe the first thing you want to do
is get the functions and tests in place).
@paleolimbot Thanks for the review and they are really helpful. I have
implemented some of the optimizations. Here's what changed and the benchmark
results.
- use little endian wkb for fast w/r
- get dim from wkb bytes directly
- add new WkbS2Shape and pre-convert wkb coordinate to S2Point (cached)
- use regions for points
- get cell union for points
- point to point distance fast path
- ShapeIndex is built from WkbS2Shape
```
ST_NPoints (Level 1 — JTS accessor)
┌───────────────────┬────────┬───────┬────────┐
│ Shape │ Before │ After │ Change │
├───────────────────┼────────┼───────┼────────┤
│ Point │ 2 │ 2 │ ~same │
├───────────────────┼────────┼───────┼────────┤
│ Polygon (16 vtx) │ 2 │ 2 │ ~same │
├───────────────────┼────────┼───────┼────────┤
│ Polygon (500 vtx) │ 2 │ 2 │ ~same │
└───────────────────┴────────┴───────┴────────┘
ST_Distance (Level 2 — S2 geodesic)
┌─────────────────────┬─────────┬─────────┬───────────────────────────────┐
│ Shape │ Before │ After │ Change │
├─────────────────────┼─────────┼─────────┼───────────────────────────────┤
│ Point │ 269 │ 245 │ 1.1x faster (point fast path) │
├─────────────────────┼─────────┼─────────┼───────────────────────────────┤
│ LineString (16 vtx) │ 1,576 │ 1,575 │ ~same │
├─────────────────────┼─────────┼─────────┼───────────────────────────────┤
│ Polygon (16 vtx) │ 1,419 │ 2,248 │ 1.6x slower │
├─────────────────────┼─────────┼─────────┼───────────────────────────────┤
│ Polygon (64 vtx) │ 69,279 │ 64,515 │ 1.07x faster │
├─────────────────────┼─────────┼─────────┼───────────────────────────────┤
│ Polygon (500 vtx) │ 224,696 │ 218,689 │ 1.03x faster │
└─────────────────────┴─────────┴─────────┴───────────────────────────────┘
```
The major improvement: WkbS2Shape avoids building 3 redundant S2ShapeIndex
instances (one per S2Loop, one for S2Polygon, and only the ShapeIndexGeography
one is actually used).
```
ST_Contains true (Level 3 — S2 predicate, point inside polygon)
┌─────────────────────┬────────┬───────┬─────────────┐
│ Shape │ Before │ After │ Change │
├─────────────────────┼────────┼───────┼─────────────┤
│ Point │ 284 │ 248 │ 1.1x faster │
├─────────────────────┼────────┼───────┼─────────────┤
│ LineString (16 vtx) │ 664 │ 238 │ 2.8x faster │
├─────────────────────┼────────┼───────┼─────────────┤
│ Polygon (16 vtx) │ 684 │ 239 │ 2.9x faster │
├─────────────────────┼────────┼───────┼─────────────┤
│ Polygon (64 vtx) │ 677 │ 233 │ 2.9x faster │
├─────────────────────┼────────┼───────┼─────────────┤
│ Polygon (500 vtx) │ 703 │ 254 │ 2.8x faster │
└─────────────────────┴────────┴───────┴─────────────┘
```
For st_contains false case, it becomes slower because
S2BooleanOperation.Contains() has different internal execution paths for true
vs false results. Not sure if sedona-db also does this (S2Polygon for polygon
predicates), which may see similar tradeoffs. But here is the results:
```
ST_Contains false (Level 3 — S2 predicate, point outside polygon)
┌─────────────────────┬────────┬───────┬─────────────┐
│ Shape │ Before │ After │ Change │
├─────────────────────┼────────┼───────┼─────────────┤
│ Point │ 226 │ 223 │ ~same │
├─────────────────────┼────────┼───────┼─────────────┤
│ LineString (16 vtx) │ 220 │ 653 │ 3x slower │
├─────────────────────┼────────┼───────┼─────────────┤
│ Polygon (16 vtx) │ 222 │ 657 │ 3x slower │
├─────────────────────┼────────┼───────┼─────────────┤
│ Polygon (64 vtx) │ 222 │ 657 │ 3x slower │
├─────────────────────┼────────┼───────┼─────────────┤
│ Polygon (500 vtx) │ 246 │ 692 │ 2.8x slower │
└─────────────────────┴────────┴───────┴─────────────┘
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]