jiayuasu opened a new pull request, #2879: URL: https://github.com/apache/sedona/pull/2879
## Summary Issue: #2700. Milestone: 1.9.1. Follows #2876. Adds the first workflow notebook in the docker-image refresh series. Demonstrates how Sedona's vector surface holds together end-to-end on a real, real-volume dataset: - Reads ~3M January 2024 yellow-taxi trips from the public [NYC TLC Cloudfront mirror](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet) — fetched at runtime to `/tmp` (Spark cannot read parquet directly through `https://` via Hadoop FileSystem). - Ships the **NYC TLC taxi-zone polygons** (263 rows, 1.5 MB shapefile, EPSG:2263 NAD83 NY State Plane / feet) under `docs/usecases/data/nyc_taxi_zones/`. Source: NYC TLC public reference data. - Reprojects zones to EPSG:4326 with **ST_Transform**, now backed by proj4sedona — accepts EPSG codes, WKT2, PROJJSON, and PROJ strings. - Aggregates trips per `(zone, hour)` and labels each zone's peak with a window function: morning / midday / evening / nightlife. - Surfaces three new-in-1.9 APIs in their natural use case: - **`ST_BingTileAt`** for tile binning of zone centroids (city-wide density layer) - **`ST_MakeLine`** over **`ST_Centroid`** pairs to build top origin-destination flow geometries - **GeoParquet 1.1 writer** with auto covering-bbox metadata (round-tripped to verify) - Renders all of it as a three-layer **SedonaKepler** map: zones colored by peak bucket, scaled centroids, top-30 OD flowlines. The notebook is tagged `requires-network: true` in its first markdown cell so the offline test harness (`SEDONA_NOTEBOOK_OFFLINE=1`, added in #2876) skips it cleanly when CI lacks outbound network. With a warm trip download and `DRIVER_MEM=4g` it completes in well under two minutes on a laptop. ## Test plan - [ ] `docker build -f docker/sedona-docker.dockerfile -t sedona:dev .` succeeds. - [ ] `docker run --rm sedona:dev /opt/sedona/docker/test-notebooks.sh` exits 0 — `00-quickstart` and `01-mobility-pulse` both pass; runtime for the latter is dominated by the trip-parquet download (~50 MB) and a single `groupBy` over 3M rows. - [ ] `docker run --rm -e SEDONA_NOTEBOOK_OFFLINE=1 sedona:dev /opt/sedona/docker/test-notebooks.sh` skips `01-mobility-pulse` and reports it in the summary. - [ ] Manual smoke: open `01-mobility-pulse.ipynb` in JupyterLab, run all cells, eyeball the three Kepler layers (zones / centroids / flowlines). - [ ] Output GeoParquet at `/tmp/nyc_taxi_zone_pulse.parquet` round-trips through `sedona.read.format("geoparquet")` and shows the expected covering bbox in the file metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
