jiayuasu opened a new pull request, #2879:
URL: https://github.com/apache/sedona/pull/2879

   ## Summary
   
   Issue: #2700. Milestone: 1.9.1. Follows #2876.
   
   Adds the first workflow notebook in the docker-image refresh series. 
Demonstrates how Sedona's vector surface holds together end-to-end on a real, 
real-volume dataset:
   
   - Reads ~3M January 2024 yellow-taxi trips from the public [NYC TLC 
Cloudfront 
mirror](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet)
 — fetched at runtime to `/tmp` (Spark cannot read parquet directly through 
`https://` via Hadoop FileSystem).
   - Ships the **NYC TLC taxi-zone polygons** (263 rows, 1.5 MB shapefile, 
EPSG:2263 NAD83 NY State Plane / feet) under 
`docs/usecases/data/nyc_taxi_zones/`. Source: NYC TLC public reference data.
   - Reprojects zones to EPSG:4326 with **ST_Transform**, now backed by 
proj4sedona — accepts EPSG codes, WKT2, PROJJSON, and PROJ strings.
   - Aggregates trips per `(zone, hour)` and labels each zone's peak with a 
window function: morning / midday / evening / nightlife.
   - Surfaces three new-in-1.9 APIs in their natural use case:
     - **`ST_BingTileAt`** for tile binning of zone centroids (city-wide 
density layer)
     - **`ST_MakeLine`** over **`ST_Centroid`** pairs to build top 
origin-destination flow geometries
     - **GeoParquet 1.1 writer** with auto covering-bbox metadata 
(round-tripped to verify)
   - Renders all of it as a three-layer **SedonaKepler** map: zones colored by 
peak bucket, scaled centroids, top-30 OD flowlines.
   
   The notebook is tagged `requires-network: true` in its first markdown cell 
so the offline test harness (`SEDONA_NOTEBOOK_OFFLINE=1`, added in #2876) skips 
it cleanly when CI lacks outbound network. With a warm trip download and 
`DRIVER_MEM=4g` it completes in well under two minutes on a laptop.
   
   ## Test plan
   
   - [ ] `docker build -f docker/sedona-docker.dockerfile -t sedona:dev .` 
succeeds.
   - [ ] `docker run --rm sedona:dev /opt/sedona/docker/test-notebooks.sh` 
exits 0 — `00-quickstart` and `01-mobility-pulse` both pass; runtime for the 
latter is dominated by the trip-parquet download (~50 MB) and a single 
`groupBy` over 3M rows.
   - [ ] `docker run --rm -e SEDONA_NOTEBOOK_OFFLINE=1 sedona:dev 
/opt/sedona/docker/test-notebooks.sh` skips `01-mobility-pulse` and reports it 
in the summary.
   - [ ] Manual smoke: open `01-mobility-pulse.ipynb` in JupyterLab, run all 
cells, eyeball the three Kepler layers (zones / centroids / flowlines).
   - [ ] Output GeoParquet at `/tmp/nyc_taxi_zone_pulse.parquet` round-trips 
through `sedona.read.format("geoparquet")` and shows the expected covering bbox 
in the file metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to