Re: [PR] [GH-2700] Add 05-geopandas-on-spark notebook [sedona]

via GitHub Fri, 01 May 2026 22:53:36 -0700


Copilot commented on code in PR #2889:
URL: https://github.com/apache/sedona/pull/2889#discussion_r3176218563



##########
docs/usecases/05-geopandas-on-spark.ipynb:
##########
@@ -0,0 +1,280 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<!--\n",
+    " Licensed to the Apache Software Foundation (ASF) under one\n",
+    " or more contributor license agreements.  See the NOTICE file\n",
+    " distributed with this work for additional information\n",
+    " regarding copyright ownership.  The ASF licenses this file\n",
+    " to you under the Apache License, Version 2.0 (the\n",
+    " \"License\"); you may not use this file except in compliance\n",
+    " with the License.  You may obtain a copy of the License at\n",
+    "\n",
+    "   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "\n",
+    " Unless required by applicable law or agreed to in writing,\n",
+    " software distributed under the License is distributed on an\n",
+    " \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    " KIND, either express or implied.  See the License for the\n",
+    " specific language governing permissions and limitations\n",
+    " under the License.\n",
+    "-->\n",
+    "\n",
+    "# Your GeoPandas notebook, scaled with Sedona\n",
+    "\n",
+    "Sedona ships a `sedona.spark.geopandas` package that mirrors the public 
GeoPandas API — same constructors, same method names, same return shapes — but 
runs on a Spark backend so the same code path scales from a laptop to a 
cluster. We answer:\n",
+    "\n",
+    "> **What does it look like to take a typical GeoPandas script and run it 
on Sedona?**\n",
+    "\n",
+    "Along the way we use methods that landed in 1.8 / 1.9 — `convex_hull`, 
`concave_hull`, `voronoi_polygons`, `clip_by_rect`, `to_crs`, `total_bounds`, 
`to_geopandas` — and show how to drop into SQL when GeoPandas's API doesn't 
have what you need. Data is the Natural Earth countries shapefile already 
shipped with the docker image; no network required."

Review Comment:
   The intro lists APIs like `concave_hull`, `to_crs`, and `voronoi_polygons` 
as being used, but the notebook doesn't actually call them (it uses 
`convex_hull`, `clip_by_rect`, `total_bounds`, `to_geopandas`, and SQL 
`ST_VoronoiPolygons`). Either add a minimal example using those APIs or remove 
them from this list to avoid misleading readers.
   



##########
docs/usecases/05-geopandas-on-spark.ipynb:
##########
@@ -0,0 +1,280 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<!--\n",
+    " Licensed to the Apache Software Foundation (ASF) under one\n",
+    " or more contributor license agreements.  See the NOTICE file\n",
+    " distributed with this work for additional information\n",
+    " regarding copyright ownership.  The ASF licenses this file\n",
+    " to you under the Apache License, Version 2.0 (the\n",
+    " \"License\"); you may not use this file except in compliance\n",
+    " with the License.  You may obtain a copy of the License at\n",
+    "\n",
+    "   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "\n",
+    " Unless required by applicable law or agreed to in writing,\n",
+    " software distributed under the License is distributed on an\n",
+    " \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    " KIND, either express or implied.  See the License for the\n",
+    " specific language governing permissions and limitations\n",
+    " under the License.\n",
+    "-->\n",
+    "\n",
+    "# Your GeoPandas notebook, scaled with Sedona\n",
+    "\n",
+    "Sedona ships a `sedona.spark.geopandas` package that mirrors the public 
GeoPandas API — same constructors, same method names, same return shapes — but 
runs on a Spark backend so the same code path scales from a laptop to a 
cluster. We answer:\n",
+    "\n",
+    "> **What does it look like to take a typical GeoPandas script and run it 
on Sedona?**\n",
+    "\n",
+    "Along the way we use methods that landed in 1.8 / 1.9 — `convex_hull`, 
`concave_hull`, `voronoi_polygons`, `clip_by_rect`, `to_crs`, `total_bounds`, 
`to_geopandas` — and show how to drop into SQL when GeoPandas's API doesn't 
have what you need. Data is the Natural Earth countries shapefile already 
shipped with the docker image; no network required."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Connect to Spark through SedonaContext\n",
+    "\n",
+    "One difference from the other example notebooks: `sedona.spark.geopandas` 
runs on top of **pandas-on-Spark** (`pyspark.pandas`), which currently requires 
Spark's ANSI mode to be off. We set that flag explicitly when building the 
session."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sedona.spark import SedonaContext\n",
+    "\n",
+    "config = (\n",
+    "    SedonaContext.builder()\n",
+    "    .master(\"spark://localhost:7077\")\n",
+    "    .config(\"spark.sql.ansi.enabled\", \"false\")\n",
+    "    .getOrCreate()\n",
+    ")\n",
+    "sedona = SedonaContext.create(config)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Load a shapefile with the same `read_file` you already know\n",
+    "\n",
+    "`sedona.spark.geopandas.read_file` is a drop-in for 
`geopandas.read_file`. The only twist when pointing at a directory is to 
declare the format explicitly — the file extension can't be inferred for 
shapefile bundles."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sedona.spark import geopandas as sgpd\n",
+    "\n",
+    "countries = sgpd.read_file(\"data/ne_50m_admin_0_countries_lakes\", 
format=\"shapefile\")\n",
+    "print(f\"loaded {len(countries)} countries\")\n",
+    "print(\"columns:\", countries.columns.tolist()[:6], \"…\")\n",
+    "countries[[\"NAME\", \"CONTINENT\", \"POP_EST\"]].head(5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Filter, then derive — the same idioms as vanilla GeoPandas\n",
+    "\n",
+    "Boolean indexing, `.geometry` accessor, `centroid`, `convex_hull`, 
`area`, and `total_bounds` all work exactly as they do in `geopandas`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "africa = countries[countries.CONTINENT == \"Africa\"]\n",
+    "print(f\"{len(africa)} African countries\")\n",
+    "\n",
+    "geom = africa.geometry\n",
+    "print(\"\\nbounding box of the continent:\", tuple(round(b, 2) for b in 
geom.total_bounds))\n",
+    "\n",
+    "summary = africa[[\"NAME\"]].copy()\n",
+    "summary[\"centroid\"] = geom.centroid\n",
+    "summary[\"area_deg2\"] = geom.area\n",
+    "summary[\"hull_area_deg2\"] = geom.convex_hull.area\n",
+    "summary.sort_values(\"area_deg2\", ascending=False).head(5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Voronoi catchments via `ST_VoronoiPolygons` + `ST_Collect_Agg`\n",
+    "\n",
+    "`GeoSeries.voronoi_polygons()` runs a Voronoi tessellation **per row**, 
which only makes sense if a single row already contains a MultiPoint. To 
compute one Voronoi diagram from many points, drop into SQL: aggregate every 
centroid into a single MultiPoint with the `ST_Collect_Agg` aggregator (new in 
1.8.1), then call `ST_VoronoiPolygons` on the aggregate."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "africa.spark.frame().createOrReplaceTempView(\"africa\")\n",
+    "\n",
+    "voronoi_geom = sedona.sql(\"\"\"\n",
+    "    SELECT ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS 
v\n",
+    "    FROM africa\n",
+    "\"\"\").first()[0]\n",
+    "\n",
+    "stats = sedona.sql(\"\"\"\n",
+    "    SELECT ST_NumGeometries(v)        AS cells,\n",
+    "           ROUND(ST_Area(v), 2)       AS total_area_deg2,\n",
+    "           ROUND(ST_XMin(v), 2)       AS xmin,\n",
+    "           ROUND(ST_YMin(v), 2)       AS ymin,\n",
+    "           ROUND(ST_XMax(v), 2)       AS xmax,\n",
+    "           ROUND(ST_YMax(v), 2)       AS ymax\n",
+    "    FROM (SELECT 
ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS v\n",
+    "          FROM africa)\n",
+    "\"\"\")\n",
+    "stats.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Clip the Voronoi diagram to a continental bounding rectangle\n",
+    "\n",
+    "`clip_by_rect(xmin, ymin, xmax, ymax)` (new in 1.9) is the 
geopandas-style way to crop. We use it to confine the Voronoi cells to a 
generous Africa bbox so they line up cleanly with the country polygons in the 
final plot."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shapely.wkt import loads as wkt_loads\n",
+    "from shapely.geometry import shape\n",
+    "\n",
+    "voronoi_shapely = (\n",
+    "    wkt_loads(voronoi_geom.wkt) if hasattr(voronoi_geom, \"wkt\") else 
voronoi_geom\n",
+    ")\n",
+    "voronoi_cells = sgpd.GeoSeries([g for g in voronoi_shapely.geoms])\n",
+    "print(f\"{len(voronoi_cells)} Voronoi cells before clip\")\n",
+    "\n",
+    "africa_bbox = (-20.0, -36.0, 52.0, 38.0)\n",
+    "clipped = voronoi_cells.clip_by_rect(*africa_bbox)\n",
+    "print(f\"{len(clipped)} Voronoi cells after clip_by_rect\")\n",

Review Comment:
   `clip_by_rect` is called here and described as “new in 1.9”. Since these 
notebooks ship in the docker image, please make the required Sedona version 
explicit (e.g., at the top of the notebook) or avoid 1.9-only APIs unless the 
docker image is guaranteed to install Sedona >= 1.9; otherwise this example 
will fail when run against older pinned versions.



##########
docs/usecases/05-geopandas-on-spark.ipynb:
##########
@@ -0,0 +1,280 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<!--\n",
+    " Licensed to the Apache Software Foundation (ASF) under one\n",
+    " or more contributor license agreements.  See the NOTICE file\n",
+    " distributed with this work for additional information\n",
+    " regarding copyright ownership.  The ASF licenses this file\n",
+    " to you under the Apache License, Version 2.0 (the\n",
+    " \"License\"); you may not use this file except in compliance\n",
+    " with the License.  You may obtain a copy of the License at\n",
+    "\n",
+    "   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "\n",
+    " Unless required by applicable law or agreed to in writing,\n",
+    " software distributed under the License is distributed on an\n",
+    " \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    " KIND, either express or implied.  See the License for the\n",
+    " specific language governing permissions and limitations\n",
+    " under the License.\n",
+    "-->\n",
+    "\n",
+    "# Your GeoPandas notebook, scaled with Sedona\n",
+    "\n",
+    "Sedona ships a `sedona.spark.geopandas` package that mirrors the public 
GeoPandas API — same constructors, same method names, same return shapes — but 
runs on a Spark backend so the same code path scales from a laptop to a 
cluster. We answer:\n",
+    "\n",
+    "> **What does it look like to take a typical GeoPandas script and run it 
on Sedona?**\n",
+    "\n",
+    "Along the way we use methods that landed in 1.8 / 1.9 — `convex_hull`, 
`concave_hull`, `voronoi_polygons`, `clip_by_rect`, `to_crs`, `total_bounds`, 
`to_geopandas` — and show how to drop into SQL when GeoPandas's API doesn't 
have what you need. Data is the Natural Earth countries shapefile already 
shipped with the docker image; no network required."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Connect to Spark through SedonaContext\n",
+    "\n",
+    "One difference from the other example notebooks: `sedona.spark.geopandas` 
runs on top of **pandas-on-Spark** (`pyspark.pandas`), which currently requires 
Spark's ANSI mode to be off. We set that flag explicitly when building the 
session."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sedona.spark import SedonaContext\n",
+    "\n",
+    "config = (\n",
+    "    SedonaContext.builder()\n",
+    "    .master(\"spark://localhost:7077\")\n",
+    "    .config(\"spark.sql.ansi.enabled\", \"false\")\n",
+    "    .getOrCreate()\n",
+    ")\n",
+    "sedona = SedonaContext.create(config)"

Review Comment:
   The last line in this cell has leading indentation (`sedona = ...`), which 
will raise an `IndentationError` when executed/nbconverted. Align it with 
`config = ...` (no unexpected indent).



##########
docs/usecases/05-geopandas-on-spark.ipynb:
##########
@@ -0,0 +1,280 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<!--\n",
+    " Licensed to the Apache Software Foundation (ASF) under one\n",
+    " or more contributor license agreements.  See the NOTICE file\n",
+    " distributed with this work for additional information\n",
+    " regarding copyright ownership.  The ASF licenses this file\n",
+    " to you under the Apache License, Version 2.0 (the\n",
+    " \"License\"); you may not use this file except in compliance\n",
+    " with the License.  You may obtain a copy of the License at\n",
+    "\n",
+    "   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "\n",
+    " Unless required by applicable law or agreed to in writing,\n",
+    " software distributed under the License is distributed on an\n",
+    " \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    " KIND, either express or implied.  See the License for the\n",
+    " specific language governing permissions and limitations\n",
+    " under the License.\n",
+    "-->\n",
+    "\n",
+    "# Your GeoPandas notebook, scaled with Sedona\n",
+    "\n",
+    "Sedona ships a `sedona.spark.geopandas` package that mirrors the public 
GeoPandas API — same constructors, same method names, same return shapes — but 
runs on a Spark backend so the same code path scales from a laptop to a 
cluster. We answer:\n",
+    "\n",
+    "> **What does it look like to take a typical GeoPandas script and run it 
on Sedona?**\n",
+    "\n",
+    "Along the way we use methods that landed in 1.8 / 1.9 — `convex_hull`, 
`concave_hull`, `voronoi_polygons`, `clip_by_rect`, `to_crs`, `total_bounds`, 
`to_geopandas` — and show how to drop into SQL when GeoPandas's API doesn't 
have what you need. Data is the Natural Earth countries shapefile already 
shipped with the docker image; no network required."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Connect to Spark through SedonaContext\n",
+    "\n",
+    "One difference from the other example notebooks: `sedona.spark.geopandas` 
runs on top of **pandas-on-Spark** (`pyspark.pandas`), which currently requires 
Spark's ANSI mode to be off. We set that flag explicitly when building the 
session."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sedona.spark import SedonaContext\n",
+    "\n",
+    "config = (\n",
+    "    SedonaContext.builder()\n",
+    "    .master(\"spark://localhost:7077\")\n",
+    "    .config(\"spark.sql.ansi.enabled\", \"false\")\n",
+    "    .getOrCreate()\n",
+    ")\n",
+    "sedona = SedonaContext.create(config)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Load a shapefile with the same `read_file` you already know\n",
+    "\n",
+    "`sedona.spark.geopandas.read_file` is a drop-in for 
`geopandas.read_file`. The only twist when pointing at a directory is to 
declare the format explicitly — the file extension can't be inferred for 
shapefile bundles."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sedona.spark import geopandas as sgpd\n",
+    "\n",
+    "countries = sgpd.read_file(\"data/ne_50m_admin_0_countries_lakes\", 
format=\"shapefile\")\n",
+    "print(f\"loaded {len(countries)} countries\")\n",
+    "print(\"columns:\", countries.columns.tolist()[:6], \"…\")\n",
+    "countries[[\"NAME\", \"CONTINENT\", \"POP_EST\"]].head(5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Filter, then derive — the same idioms as vanilla GeoPandas\n",
+    "\n",
+    "Boolean indexing, `.geometry` accessor, `centroid`, `convex_hull`, 
`area`, and `total_bounds` all work exactly as they do in `geopandas`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "africa = countries[countries.CONTINENT == \"Africa\"]\n",
+    "print(f\"{len(africa)} African countries\")\n",
+    "\n",
+    "geom = africa.geometry\n",
+    "print(\"\\nbounding box of the continent:\", tuple(round(b, 2) for b in 
geom.total_bounds))\n",
+    "\n",
+    "summary = africa[[\"NAME\"]].copy()\n",
+    "summary[\"centroid\"] = geom.centroid\n",
+    "summary[\"area_deg2\"] = geom.area\n",
+    "summary[\"hull_area_deg2\"] = geom.convex_hull.area\n",
+    "summary.sort_values(\"area_deg2\", ascending=False).head(5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Voronoi catchments via `ST_VoronoiPolygons` + `ST_Collect_Agg`\n",
+    "\n",
+    "`GeoSeries.voronoi_polygons()` runs a Voronoi tessellation **per row**, 
which only makes sense if a single row already contains a MultiPoint. To 
compute one Voronoi diagram from many points, drop into SQL: aggregate every 
centroid into a single MultiPoint with the `ST_Collect_Agg` aggregator (new in 
1.8.1), then call `ST_VoronoiPolygons` on the aggregate."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "africa.spark.frame().createOrReplaceTempView(\"africa\")\n",
+    "\n",
+    "voronoi_geom = sedona.sql(\"\"\"\n",
+    "    SELECT ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS 
v\n",
+    "    FROM africa\n",
+    "\"\"\").first()[0]\n",
+    "\n",
+    "stats = sedona.sql(\"\"\"\n",
+    "    SELECT ST_NumGeometries(v)        AS cells,\n",
+    "           ROUND(ST_Area(v), 2)       AS total_area_deg2,\n",
+    "           ROUND(ST_XMin(v), 2)       AS xmin,\n",
+    "           ROUND(ST_YMin(v), 2)       AS ymin,\n",
+    "           ROUND(ST_XMax(v), 2)       AS xmax,\n",
+    "           ROUND(ST_YMax(v), 2)       AS ymax\n",
+    "    FROM (SELECT 
ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS v\n",
+    "          FROM africa)\n",
+    "\"\"\")\n",
+    "stats.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Clip the Voronoi diagram to a continental bounding rectangle\n",
+    "\n",
+    "`clip_by_rect(xmin, ymin, xmax, ymax)` (new in 1.9) is the 
geopandas-style way to crop. We use it to confine the Voronoi cells to a 
generous Africa bbox so they line up cleanly with the country polygons in the 
final plot."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shapely.wkt import loads as wkt_loads\n",
+    "from shapely.geometry import shape\n",
+    "\n",
+    "voronoi_shapely = (\n",
+    "    wkt_loads(voronoi_geom.wkt) if hasattr(voronoi_geom, \"wkt\") else 
voronoi_geom\n",
+    ")\n",
+    "voronoi_cells = sgpd.GeoSeries([g for g in voronoi_shapely.geoms])\n",
+    "print(f\"{len(voronoi_cells)} Voronoi cells before clip\")\n",
+    "\n",
+    "africa_bbox = (-20.0, -36.0, 52.0, 38.0)\n",
+    "clipped = voronoi_cells.clip_by_rect(*africa_bbox)\n",
+    "print(f\"{len(clipped)} Voronoi cells after clip_by_rect\")\n",
+    "clipped.head(2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Hand off to vanilla GeoPandas for plotting\n",
+    "\n",
+    "When the data is small enough, `to_geopandas()` materializes a Sedona 
GeoDataFrame as a vanilla `geopandas.GeoDataFrame` so it can be plotted with 
the standard `.plot(ax=...)` machinery."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "africa_gp = africa.to_geopandas()\n",
+    "voronoi_gp = clipped.to_geopandas()\n",
+    "\n",
+    "fig, ax = plt.subplots(1, 1, figsize=(8, 8))\n",
+    "africa_gp.plot(ax=ax, color=\"#fdf6e3\", edgecolor=\"#586e75\", 
linewidth=0.6)\n",
+    "voronoi_gp.boundary.plot(ax=ax, color=\"#dc322f\", linewidth=0.4, 
alpha=0.8)\n",
+    "africa_gp.geometry.centroid.plot(ax=ax, color=\"#dc322f\", 
markersize=4)\n",
+    "ax.set_title(\"Africa: Voronoi catchments around country centroids\")\n",
+    "ax.set_xlabel(\"longitude\")\n",
+    "ax.set_ylabel(\"latitude\")\n",
+    "ax.set_xlim(africa_bbox[0], africa_bbox[2])\n",
+    "ax.set_ylim(africa_bbox[1], africa_bbox[3])\n",
+    "fig.tight_layout()\n",
+    "fig"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Drop into SQL whenever you need a function the API doesn't 
expose\n",
+    "\n",
+    "`<gdf>.spark.frame()` returns the underlying Spark DataFrame, so the 
entire `ST_*` SQL catalog is one `createOrReplaceTempView` away. Here we ask 
which African capitals are closest to (0°N, 0°E) using `ST_DistanceSpheroid` 
(great-circle distance in metres), without leaving the data we already loaded 
with the geopandas API."

Review Comment:
   This section (and the query below) says “African capitals”, but the notebook 
only loads country polygons and then uses `ST_Centroid(geometry)`; the result 
is the closest *country centroid*, not a capital. Please reword to “countries” 
(or, if you intended capitals, load a capitals dataset / use capital 
coordinates if present in attributes).
   



##########
docs/usecases/05-geopandas-on-spark.ipynb:
##########
@@ -0,0 +1,280 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<!--\n",
+    " Licensed to the Apache Software Foundation (ASF) under one\n",
+    " or more contributor license agreements.  See the NOTICE file\n",
+    " distributed with this work for additional information\n",
+    " regarding copyright ownership.  The ASF licenses this file\n",
+    " to you under the Apache License, Version 2.0 (the\n",
+    " \"License\"); you may not use this file except in compliance\n",
+    " with the License.  You may obtain a copy of the License at\n",
+    "\n",
+    "   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "\n",
+    " Unless required by applicable law or agreed to in writing,\n",
+    " software distributed under the License is distributed on an\n",
+    " \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    " KIND, either express or implied.  See the License for the\n",
+    " specific language governing permissions and limitations\n",
+    " under the License.\n",
+    "-->\n",
+    "\n",
+    "# Your GeoPandas notebook, scaled with Sedona\n",
+    "\n",
+    "Sedona ships a `sedona.spark.geopandas` package that mirrors the public 
GeoPandas API — same constructors, same method names, same return shapes — but 
runs on a Spark backend so the same code path scales from a laptop to a 
cluster. We answer:\n",
+    "\n",
+    "> **What does it look like to take a typical GeoPandas script and run it 
on Sedona?**\n",
+    "\n",
+    "Along the way we use methods that landed in 1.8 / 1.9 — `convex_hull`, 
`concave_hull`, `voronoi_polygons`, `clip_by_rect`, `to_crs`, `total_bounds`, 
`to_geopandas` — and show how to drop into SQL when GeoPandas's API doesn't 
have what you need. Data is the Natural Earth countries shapefile already 
shipped with the docker image; no network required."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Connect to Spark through SedonaContext\n",
+    "\n",
+    "One difference from the other example notebooks: `sedona.spark.geopandas` 
runs on top of **pandas-on-Spark** (`pyspark.pandas`), which currently requires 
Spark's ANSI mode to be off. We set that flag explicitly when building the 
session."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sedona.spark import SedonaContext\n",
+    "\n",
+    "config = (\n",
+    "    SedonaContext.builder()\n",
+    "    .master(\"spark://localhost:7077\")\n",
+    "    .config(\"spark.sql.ansi.enabled\", \"false\")\n",
+    "    .getOrCreate()\n",
+    ")\n",
+    "sedona = SedonaContext.create(config)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Load a shapefile with the same `read_file` you already know\n",
+    "\n",
+    "`sedona.spark.geopandas.read_file` is a drop-in for 
`geopandas.read_file`. The only twist when pointing at a directory is to 
declare the format explicitly — the file extension can't be inferred for 
shapefile bundles."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sedona.spark import geopandas as sgpd\n",
+    "\n",
+    "countries = sgpd.read_file(\"data/ne_50m_admin_0_countries_lakes\", 
format=\"shapefile\")\n",
+    "print(f\"loaded {len(countries)} countries\")\n",
+    "print(\"columns:\", countries.columns.tolist()[:6], \"…\")\n",
+    "countries[[\"NAME\", \"CONTINENT\", \"POP_EST\"]].head(5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Filter, then derive — the same idioms as vanilla GeoPandas\n",
+    "\n",
+    "Boolean indexing, `.geometry` accessor, `centroid`, `convex_hull`, 
`area`, and `total_bounds` all work exactly as they do in `geopandas`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "africa = countries[countries.CONTINENT == \"Africa\"]\n",
+    "print(f\"{len(africa)} African countries\")\n",
+    "\n",
+    "geom = africa.geometry\n",
+    "print(\"\\nbounding box of the continent:\", tuple(round(b, 2) for b in 
geom.total_bounds))\n",
+    "\n",
+    "summary = africa[[\"NAME\"]].copy()\n",
+    "summary[\"centroid\"] = geom.centroid\n",
+    "summary[\"area_deg2\"] = geom.area\n",
+    "summary[\"hull_area_deg2\"] = geom.convex_hull.area\n",
+    "summary.sort_values(\"area_deg2\", ascending=False).head(5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Voronoi catchments via `ST_VoronoiPolygons` + `ST_Collect_Agg`\n",
+    "\n",
+    "`GeoSeries.voronoi_polygons()` runs a Voronoi tessellation **per row**, 
which only makes sense if a single row already contains a MultiPoint. To 
compute one Voronoi diagram from many points, drop into SQL: aggregate every 
centroid into a single MultiPoint with the `ST_Collect_Agg` aggregator (new in 
1.8.1), then call `ST_VoronoiPolygons` on the aggregate."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "africa.spark.frame().createOrReplaceTempView(\"africa\")\n",
+    "\n",
+    "voronoi_geom = sedona.sql(\"\"\"\n",
+    "    SELECT ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS 
v\n",
+    "    FROM africa\n",
+    "\"\"\").first()[0]\n",
+    "\n",
+    "stats = sedona.sql(\"\"\"\n",
+    "    SELECT ST_NumGeometries(v)        AS cells,\n",
+    "           ROUND(ST_Area(v), 2)       AS total_area_deg2,\n",
+    "           ROUND(ST_XMin(v), 2)       AS xmin,\n",
+    "           ROUND(ST_YMin(v), 2)       AS ymin,\n",
+    "           ROUND(ST_XMax(v), 2)       AS xmax,\n",
+    "           ROUND(ST_YMax(v), 2)       AS ymax\n",
+    "    FROM (SELECT 
ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS v\n",
+    "          FROM africa)\n",
+    "\"\"\")\n",
+    "stats.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Clip the Voronoi diagram to a continental bounding rectangle\n",
+    "\n",
+    "`clip_by_rect(xmin, ymin, xmax, ymax)` (new in 1.9) is the 
geopandas-style way to crop. We use it to confine the Voronoi cells to a 
generous Africa bbox so they line up cleanly with the country polygons in the 
final plot."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from shapely.wkt import loads as wkt_loads\n",
+    "from shapely.geometry import shape\n",

Review Comment:
   `shape` is imported but never used in this notebook. Removing unused imports 
helps keep the example minimal and avoids suggesting that `shape(...)` is 
needed for the workflow.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GH-2700] Add 05-geopandas-on-spark notebook [sedona]

Reply via email to