Copilot commented on code in PR #2889:
URL: https://github.com/apache/sedona/pull/2889#discussion_r3176218563
##########
docs/usecases/05-geopandas-on-spark.ipynb:
##########
@@ -0,0 +1,280 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<!--\n",
+ " Licensed to the Apache Software Foundation (ASF) under one\n",
+ " or more contributor license agreements. See the NOTICE file\n",
+ " distributed with this work for additional information\n",
+ " regarding copyright ownership. The ASF licenses this file\n",
+ " to you under the Apache License, Version 2.0 (the\n",
+ " \"License\"); you may not use this file except in compliance\n",
+ " with the License. You may obtain a copy of the License at\n",
+ "\n",
+ " http://www.apache.org/licenses/LICENSE-2.0\n",
+ "\n",
+ " Unless required by applicable law or agreed to in writing,\n",
+ " software distributed under the License is distributed on an\n",
+ " \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " KIND, either express or implied. See the License for the\n",
+ " specific language governing permissions and limitations\n",
+ " under the License.\n",
+ "-->\n",
+ "\n",
+ "# Your GeoPandas notebook, scaled with Sedona\n",
+ "\n",
+ "Sedona ships a `sedona.spark.geopandas` package that mirrors the public
GeoPandas API — same constructors, same method names, same return shapes — but
runs on a Spark backend so the same code path scales from a laptop to a
cluster. We answer:\n",
+ "\n",
+ "> **What does it look like to take a typical GeoPandas script and run it
on Sedona?**\n",
+ "\n",
+ "Along the way we use methods that landed in 1.8 / 1.9 — `convex_hull`,
`concave_hull`, `voronoi_polygons`, `clip_by_rect`, `to_crs`, `total_bounds`,
`to_geopandas` — and show how to drop into SQL when GeoPandas's API doesn't
have what you need. Data is the Natural Earth countries shapefile already
shipped with the docker image; no network required."
Review Comment:
The intro lists APIs like `concave_hull`, `to_crs`, and `voronoi_polygons`
as being used, but the notebook doesn't actually call them (it uses
`convex_hull`, `clip_by_rect`, `total_bounds`, `to_geopandas`, and SQL
`ST_VoronoiPolygons`). Either add a minimal example using those APIs or remove
them from this list to avoid misleading readers.
##########
docs/usecases/05-geopandas-on-spark.ipynb:
##########
@@ -0,0 +1,280 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<!--\n",
+ " Licensed to the Apache Software Foundation (ASF) under one\n",
+ " or more contributor license agreements. See the NOTICE file\n",
+ " distributed with this work for additional information\n",
+ " regarding copyright ownership. The ASF licenses this file\n",
+ " to you under the Apache License, Version 2.0 (the\n",
+ " \"License\"); you may not use this file except in compliance\n",
+ " with the License. You may obtain a copy of the License at\n",
+ "\n",
+ " http://www.apache.org/licenses/LICENSE-2.0\n",
+ "\n",
+ " Unless required by applicable law or agreed to in writing,\n",
+ " software distributed under the License is distributed on an\n",
+ " \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " KIND, either express or implied. See the License for the\n",
+ " specific language governing permissions and limitations\n",
+ " under the License.\n",
+ "-->\n",
+ "\n",
+ "# Your GeoPandas notebook, scaled with Sedona\n",
+ "\n",
+ "Sedona ships a `sedona.spark.geopandas` package that mirrors the public
GeoPandas API — same constructors, same method names, same return shapes — but
runs on a Spark backend so the same code path scales from a laptop to a
cluster. We answer:\n",
+ "\n",
+ "> **What does it look like to take a typical GeoPandas script and run it
on Sedona?**\n",
+ "\n",
+ "Along the way we use methods that landed in 1.8 / 1.9 — `convex_hull`,
`concave_hull`, `voronoi_polygons`, `clip_by_rect`, `to_crs`, `total_bounds`,
`to_geopandas` — and show how to drop into SQL when GeoPandas's API doesn't
have what you need. Data is the Natural Earth countries shapefile already
shipped with the docker image; no network required."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 1. Connect to Spark through SedonaContext\n",
+ "\n",
+ "One difference from the other example notebooks: `sedona.spark.geopandas`
runs on top of **pandas-on-Spark** (`pyspark.pandas`), which currently requires
Spark's ANSI mode to be off. We set that flag explicitly when building the
session."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sedona.spark import SedonaContext\n",
+ "\n",
+ "config = (\n",
+ " SedonaContext.builder()\n",
+ " .master(\"spark://localhost:7077\")\n",
+ " .config(\"spark.sql.ansi.enabled\", \"false\")\n",
+ " .getOrCreate()\n",
+ ")\n",
+ "sedona = SedonaContext.create(config)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2. Load a shapefile with the same `read_file` you already know\n",
+ "\n",
+ "`sedona.spark.geopandas.read_file` is a drop-in for
`geopandas.read_file`. The only twist when pointing at a directory is to
declare the format explicitly — the file extension can't be inferred for
shapefile bundles."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sedona.spark import geopandas as sgpd\n",
+ "\n",
+ "countries = sgpd.read_file(\"data/ne_50m_admin_0_countries_lakes\",
format=\"shapefile\")\n",
+ "print(f\"loaded {len(countries)} countries\")\n",
+ "print(\"columns:\", countries.columns.tolist()[:6], \"…\")\n",
+ "countries[[\"NAME\", \"CONTINENT\", \"POP_EST\"]].head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 3. Filter, then derive — the same idioms as vanilla GeoPandas\n",
+ "\n",
+ "Boolean indexing, `.geometry` accessor, `centroid`, `convex_hull`,
`area`, and `total_bounds` all work exactly as they do in `geopandas`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "africa = countries[countries.CONTINENT == \"Africa\"]\n",
+ "print(f\"{len(africa)} African countries\")\n",
+ "\n",
+ "geom = africa.geometry\n",
+ "print(\"\\nbounding box of the continent:\", tuple(round(b, 2) for b in
geom.total_bounds))\n",
+ "\n",
+ "summary = africa[[\"NAME\"]].copy()\n",
+ "summary[\"centroid\"] = geom.centroid\n",
+ "summary[\"area_deg2\"] = geom.area\n",
+ "summary[\"hull_area_deg2\"] = geom.convex_hull.area\n",
+ "summary.sort_values(\"area_deg2\", ascending=False).head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 4. Voronoi catchments via `ST_VoronoiPolygons` + `ST_Collect_Agg`\n",
+ "\n",
+ "`GeoSeries.voronoi_polygons()` runs a Voronoi tessellation **per row**,
which only makes sense if a single row already contains a MultiPoint. To
compute one Voronoi diagram from many points, drop into SQL: aggregate every
centroid into a single MultiPoint with the `ST_Collect_Agg` aggregator (new in
1.8.1), then call `ST_VoronoiPolygons` on the aggregate."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "africa.spark.frame().createOrReplaceTempView(\"africa\")\n",
+ "\n",
+ "voronoi_geom = sedona.sql(\"\"\"\n",
+ " SELECT ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS
v\n",
+ " FROM africa\n",
+ "\"\"\").first()[0]\n",
+ "\n",
+ "stats = sedona.sql(\"\"\"\n",
+ " SELECT ST_NumGeometries(v) AS cells,\n",
+ " ROUND(ST_Area(v), 2) AS total_area_deg2,\n",
+ " ROUND(ST_XMin(v), 2) AS xmin,\n",
+ " ROUND(ST_YMin(v), 2) AS ymin,\n",
+ " ROUND(ST_XMax(v), 2) AS xmax,\n",
+ " ROUND(ST_YMax(v), 2) AS ymax\n",
+ " FROM (SELECT
ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS v\n",
+ " FROM africa)\n",
+ "\"\"\")\n",
+ "stats.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 5. Clip the Voronoi diagram to a continental bounding rectangle\n",
+ "\n",
+ "`clip_by_rect(xmin, ymin, xmax, ymax)` (new in 1.9) is the
geopandas-style way to crop. We use it to confine the Voronoi cells to a
generous Africa bbox so they line up cleanly with the country polygons in the
final plot."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from shapely.wkt import loads as wkt_loads\n",
+ "from shapely.geometry import shape\n",
+ "\n",
+ "voronoi_shapely = (\n",
+ " wkt_loads(voronoi_geom.wkt) if hasattr(voronoi_geom, \"wkt\") else
voronoi_geom\n",
+ ")\n",
+ "voronoi_cells = sgpd.GeoSeries([g for g in voronoi_shapely.geoms])\n",
+ "print(f\"{len(voronoi_cells)} Voronoi cells before clip\")\n",
+ "\n",
+ "africa_bbox = (-20.0, -36.0, 52.0, 38.0)\n",
+ "clipped = voronoi_cells.clip_by_rect(*africa_bbox)\n",
+ "print(f\"{len(clipped)} Voronoi cells after clip_by_rect\")\n",
Review Comment:
`clip_by_rect` is called here and described as “new in 1.9”. Since these
notebooks ship in the docker image, please make the required Sedona version
explicit (e.g., at the top of the notebook) or avoid 1.9-only APIs unless the
docker image is guaranteed to install Sedona >= 1.9; otherwise this example
will fail when run against older pinned versions.
##########
docs/usecases/05-geopandas-on-spark.ipynb:
##########
@@ -0,0 +1,280 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<!--\n",
+ " Licensed to the Apache Software Foundation (ASF) under one\n",
+ " or more contributor license agreements. See the NOTICE file\n",
+ " distributed with this work for additional information\n",
+ " regarding copyright ownership. The ASF licenses this file\n",
+ " to you under the Apache License, Version 2.0 (the\n",
+ " \"License\"); you may not use this file except in compliance\n",
+ " with the License. You may obtain a copy of the License at\n",
+ "\n",
+ " http://www.apache.org/licenses/LICENSE-2.0\n",
+ "\n",
+ " Unless required by applicable law or agreed to in writing,\n",
+ " software distributed under the License is distributed on an\n",
+ " \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " KIND, either express or implied. See the License for the\n",
+ " specific language governing permissions and limitations\n",
+ " under the License.\n",
+ "-->\n",
+ "\n",
+ "# Your GeoPandas notebook, scaled with Sedona\n",
+ "\n",
+ "Sedona ships a `sedona.spark.geopandas` package that mirrors the public
GeoPandas API — same constructors, same method names, same return shapes — but
runs on a Spark backend so the same code path scales from a laptop to a
cluster. We answer:\n",
+ "\n",
+ "> **What does it look like to take a typical GeoPandas script and run it
on Sedona?**\n",
+ "\n",
+ "Along the way we use methods that landed in 1.8 / 1.9 — `convex_hull`,
`concave_hull`, `voronoi_polygons`, `clip_by_rect`, `to_crs`, `total_bounds`,
`to_geopandas` — and show how to drop into SQL when GeoPandas's API doesn't
have what you need. Data is the Natural Earth countries shapefile already
shipped with the docker image; no network required."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 1. Connect to Spark through SedonaContext\n",
+ "\n",
+ "One difference from the other example notebooks: `sedona.spark.geopandas`
runs on top of **pandas-on-Spark** (`pyspark.pandas`), which currently requires
Spark's ANSI mode to be off. We set that flag explicitly when building the
session."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sedona.spark import SedonaContext\n",
+ "\n",
+ "config = (\n",
+ " SedonaContext.builder()\n",
+ " .master(\"spark://localhost:7077\")\n",
+ " .config(\"spark.sql.ansi.enabled\", \"false\")\n",
+ " .getOrCreate()\n",
+ ")\n",
+ "sedona = SedonaContext.create(config)"
Review Comment:
The last line in this cell has leading indentation (`sedona = ...`), which
will raise an `IndentationError` when executed/nbconverted. Align it with
`config = ...` (no unexpected indent).
##########
docs/usecases/05-geopandas-on-spark.ipynb:
##########
@@ -0,0 +1,280 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<!--\n",
+ " Licensed to the Apache Software Foundation (ASF) under one\n",
+ " or more contributor license agreements. See the NOTICE file\n",
+ " distributed with this work for additional information\n",
+ " regarding copyright ownership. The ASF licenses this file\n",
+ " to you under the Apache License, Version 2.0 (the\n",
+ " \"License\"); you may not use this file except in compliance\n",
+ " with the License. You may obtain a copy of the License at\n",
+ "\n",
+ " http://www.apache.org/licenses/LICENSE-2.0\n",
+ "\n",
+ " Unless required by applicable law or agreed to in writing,\n",
+ " software distributed under the License is distributed on an\n",
+ " \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " KIND, either express or implied. See the License for the\n",
+ " specific language governing permissions and limitations\n",
+ " under the License.\n",
+ "-->\n",
+ "\n",
+ "# Your GeoPandas notebook, scaled with Sedona\n",
+ "\n",
+ "Sedona ships a `sedona.spark.geopandas` package that mirrors the public
GeoPandas API — same constructors, same method names, same return shapes — but
runs on a Spark backend so the same code path scales from a laptop to a
cluster. We answer:\n",
+ "\n",
+ "> **What does it look like to take a typical GeoPandas script and run it
on Sedona?**\n",
+ "\n",
+ "Along the way we use methods that landed in 1.8 / 1.9 — `convex_hull`,
`concave_hull`, `voronoi_polygons`, `clip_by_rect`, `to_crs`, `total_bounds`,
`to_geopandas` — and show how to drop into SQL when GeoPandas's API doesn't
have what you need. Data is the Natural Earth countries shapefile already
shipped with the docker image; no network required."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 1. Connect to Spark through SedonaContext\n",
+ "\n",
+ "One difference from the other example notebooks: `sedona.spark.geopandas`
runs on top of **pandas-on-Spark** (`pyspark.pandas`), which currently requires
Spark's ANSI mode to be off. We set that flag explicitly when building the
session."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sedona.spark import SedonaContext\n",
+ "\n",
+ "config = (\n",
+ " SedonaContext.builder()\n",
+ " .master(\"spark://localhost:7077\")\n",
+ " .config(\"spark.sql.ansi.enabled\", \"false\")\n",
+ " .getOrCreate()\n",
+ ")\n",
+ "sedona = SedonaContext.create(config)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2. Load a shapefile with the same `read_file` you already know\n",
+ "\n",
+ "`sedona.spark.geopandas.read_file` is a drop-in for
`geopandas.read_file`. The only twist when pointing at a directory is to
declare the format explicitly — the file extension can't be inferred for
shapefile bundles."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sedona.spark import geopandas as sgpd\n",
+ "\n",
+ "countries = sgpd.read_file(\"data/ne_50m_admin_0_countries_lakes\",
format=\"shapefile\")\n",
+ "print(f\"loaded {len(countries)} countries\")\n",
+ "print(\"columns:\", countries.columns.tolist()[:6], \"…\")\n",
+ "countries[[\"NAME\", \"CONTINENT\", \"POP_EST\"]].head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 3. Filter, then derive — the same idioms as vanilla GeoPandas\n",
+ "\n",
+ "Boolean indexing, `.geometry` accessor, `centroid`, `convex_hull`,
`area`, and `total_bounds` all work exactly as they do in `geopandas`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "africa = countries[countries.CONTINENT == \"Africa\"]\n",
+ "print(f\"{len(africa)} African countries\")\n",
+ "\n",
+ "geom = africa.geometry\n",
+ "print(\"\\nbounding box of the continent:\", tuple(round(b, 2) for b in
geom.total_bounds))\n",
+ "\n",
+ "summary = africa[[\"NAME\"]].copy()\n",
+ "summary[\"centroid\"] = geom.centroid\n",
+ "summary[\"area_deg2\"] = geom.area\n",
+ "summary[\"hull_area_deg2\"] = geom.convex_hull.area\n",
+ "summary.sort_values(\"area_deg2\", ascending=False).head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 4. Voronoi catchments via `ST_VoronoiPolygons` + `ST_Collect_Agg`\n",
+ "\n",
+ "`GeoSeries.voronoi_polygons()` runs a Voronoi tessellation **per row**,
which only makes sense if a single row already contains a MultiPoint. To
compute one Voronoi diagram from many points, drop into SQL: aggregate every
centroid into a single MultiPoint with the `ST_Collect_Agg` aggregator (new in
1.8.1), then call `ST_VoronoiPolygons` on the aggregate."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "africa.spark.frame().createOrReplaceTempView(\"africa\")\n",
+ "\n",
+ "voronoi_geom = sedona.sql(\"\"\"\n",
+ " SELECT ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS
v\n",
+ " FROM africa\n",
+ "\"\"\").first()[0]\n",
+ "\n",
+ "stats = sedona.sql(\"\"\"\n",
+ " SELECT ST_NumGeometries(v) AS cells,\n",
+ " ROUND(ST_Area(v), 2) AS total_area_deg2,\n",
+ " ROUND(ST_XMin(v), 2) AS xmin,\n",
+ " ROUND(ST_YMin(v), 2) AS ymin,\n",
+ " ROUND(ST_XMax(v), 2) AS xmax,\n",
+ " ROUND(ST_YMax(v), 2) AS ymax\n",
+ " FROM (SELECT
ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS v\n",
+ " FROM africa)\n",
+ "\"\"\")\n",
+ "stats.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 5. Clip the Voronoi diagram to a continental bounding rectangle\n",
+ "\n",
+ "`clip_by_rect(xmin, ymin, xmax, ymax)` (new in 1.9) is the
geopandas-style way to crop. We use it to confine the Voronoi cells to a
generous Africa bbox so they line up cleanly with the country polygons in the
final plot."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from shapely.wkt import loads as wkt_loads\n",
+ "from shapely.geometry import shape\n",
+ "\n",
+ "voronoi_shapely = (\n",
+ " wkt_loads(voronoi_geom.wkt) if hasattr(voronoi_geom, \"wkt\") else
voronoi_geom\n",
+ ")\n",
+ "voronoi_cells = sgpd.GeoSeries([g for g in voronoi_shapely.geoms])\n",
+ "print(f\"{len(voronoi_cells)} Voronoi cells before clip\")\n",
+ "\n",
+ "africa_bbox = (-20.0, -36.0, 52.0, 38.0)\n",
+ "clipped = voronoi_cells.clip_by_rect(*africa_bbox)\n",
+ "print(f\"{len(clipped)} Voronoi cells after clip_by_rect\")\n",
+ "clipped.head(2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 6. Hand off to vanilla GeoPandas for plotting\n",
+ "\n",
+ "When the data is small enough, `to_geopandas()` materializes a Sedona
GeoDataFrame as a vanilla `geopandas.GeoDataFrame` so it can be plotted with
the standard `.plot(ax=...)` machinery."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "africa_gp = africa.to_geopandas()\n",
+ "voronoi_gp = clipped.to_geopandas()\n",
+ "\n",
+ "fig, ax = plt.subplots(1, 1, figsize=(8, 8))\n",
+ "africa_gp.plot(ax=ax, color=\"#fdf6e3\", edgecolor=\"#586e75\",
linewidth=0.6)\n",
+ "voronoi_gp.boundary.plot(ax=ax, color=\"#dc322f\", linewidth=0.4,
alpha=0.8)\n",
+ "africa_gp.geometry.centroid.plot(ax=ax, color=\"#dc322f\",
markersize=4)\n",
+ "ax.set_title(\"Africa: Voronoi catchments around country centroids\")\n",
+ "ax.set_xlabel(\"longitude\")\n",
+ "ax.set_ylabel(\"latitude\")\n",
+ "ax.set_xlim(africa_bbox[0], africa_bbox[2])\n",
+ "ax.set_ylim(africa_bbox[1], africa_bbox[3])\n",
+ "fig.tight_layout()\n",
+ "fig"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 7. Drop into SQL whenever you need a function the API doesn't
expose\n",
+ "\n",
+ "`<gdf>.spark.frame()` returns the underlying Spark DataFrame, so the
entire `ST_*` SQL catalog is one `createOrReplaceTempView` away. Here we ask
which African capitals are closest to (0°N, 0°E) using `ST_DistanceSpheroid`
(great-circle distance in metres), without leaving the data we already loaded
with the geopandas API."
Review Comment:
This section (and the query below) says “African capitals”, but the notebook
only loads country polygons and then uses `ST_Centroid(geometry)`; the result
is the closest *country centroid*, not a capital. Please reword to “countries”
(or, if you intended capitals, load a capitals dataset / use capital
coordinates if present in attributes).
##########
docs/usecases/05-geopandas-on-spark.ipynb:
##########
@@ -0,0 +1,280 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<!--\n",
+ " Licensed to the Apache Software Foundation (ASF) under one\n",
+ " or more contributor license agreements. See the NOTICE file\n",
+ " distributed with this work for additional information\n",
+ " regarding copyright ownership. The ASF licenses this file\n",
+ " to you under the Apache License, Version 2.0 (the\n",
+ " \"License\"); you may not use this file except in compliance\n",
+ " with the License. You may obtain a copy of the License at\n",
+ "\n",
+ " http://www.apache.org/licenses/LICENSE-2.0\n",
+ "\n",
+ " Unless required by applicable law or agreed to in writing,\n",
+ " software distributed under the License is distributed on an\n",
+ " \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " KIND, either express or implied. See the License for the\n",
+ " specific language governing permissions and limitations\n",
+ " under the License.\n",
+ "-->\n",
+ "\n",
+ "# Your GeoPandas notebook, scaled with Sedona\n",
+ "\n",
+ "Sedona ships a `sedona.spark.geopandas` package that mirrors the public
GeoPandas API — same constructors, same method names, same return shapes — but
runs on a Spark backend so the same code path scales from a laptop to a
cluster. We answer:\n",
+ "\n",
+ "> **What does it look like to take a typical GeoPandas script and run it
on Sedona?**\n",
+ "\n",
+ "Along the way we use methods that landed in 1.8 / 1.9 — `convex_hull`,
`concave_hull`, `voronoi_polygons`, `clip_by_rect`, `to_crs`, `total_bounds`,
`to_geopandas` — and show how to drop into SQL when GeoPandas's API doesn't
have what you need. Data is the Natural Earth countries shapefile already
shipped with the docker image; no network required."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 1. Connect to Spark through SedonaContext\n",
+ "\n",
+ "One difference from the other example notebooks: `sedona.spark.geopandas`
runs on top of **pandas-on-Spark** (`pyspark.pandas`), which currently requires
Spark's ANSI mode to be off. We set that flag explicitly when building the
session."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sedona.spark import SedonaContext\n",
+ "\n",
+ "config = (\n",
+ " SedonaContext.builder()\n",
+ " .master(\"spark://localhost:7077\")\n",
+ " .config(\"spark.sql.ansi.enabled\", \"false\")\n",
+ " .getOrCreate()\n",
+ ")\n",
+ "sedona = SedonaContext.create(config)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 2. Load a shapefile with the same `read_file` you already know\n",
+ "\n",
+ "`sedona.spark.geopandas.read_file` is a drop-in for
`geopandas.read_file`. The only twist when pointing at a directory is to
declare the format explicitly — the file extension can't be inferred for
shapefile bundles."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sedona.spark import geopandas as sgpd\n",
+ "\n",
+ "countries = sgpd.read_file(\"data/ne_50m_admin_0_countries_lakes\",
format=\"shapefile\")\n",
+ "print(f\"loaded {len(countries)} countries\")\n",
+ "print(\"columns:\", countries.columns.tolist()[:6], \"…\")\n",
+ "countries[[\"NAME\", \"CONTINENT\", \"POP_EST\"]].head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 3. Filter, then derive — the same idioms as vanilla GeoPandas\n",
+ "\n",
+ "Boolean indexing, `.geometry` accessor, `centroid`, `convex_hull`,
`area`, and `total_bounds` all work exactly as they do in `geopandas`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "africa = countries[countries.CONTINENT == \"Africa\"]\n",
+ "print(f\"{len(africa)} African countries\")\n",
+ "\n",
+ "geom = africa.geometry\n",
+ "print(\"\\nbounding box of the continent:\", tuple(round(b, 2) for b in
geom.total_bounds))\n",
+ "\n",
+ "summary = africa[[\"NAME\"]].copy()\n",
+ "summary[\"centroid\"] = geom.centroid\n",
+ "summary[\"area_deg2\"] = geom.area\n",
+ "summary[\"hull_area_deg2\"] = geom.convex_hull.area\n",
+ "summary.sort_values(\"area_deg2\", ascending=False).head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 4. Voronoi catchments via `ST_VoronoiPolygons` + `ST_Collect_Agg`\n",
+ "\n",
+ "`GeoSeries.voronoi_polygons()` runs a Voronoi tessellation **per row**,
which only makes sense if a single row already contains a MultiPoint. To
compute one Voronoi diagram from many points, drop into SQL: aggregate every
centroid into a single MultiPoint with the `ST_Collect_Agg` aggregator (new in
1.8.1), then call `ST_VoronoiPolygons` on the aggregate."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "africa.spark.frame().createOrReplaceTempView(\"africa\")\n",
+ "\n",
+ "voronoi_geom = sedona.sql(\"\"\"\n",
+ " SELECT ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS
v\n",
+ " FROM africa\n",
+ "\"\"\").first()[0]\n",
+ "\n",
+ "stats = sedona.sql(\"\"\"\n",
+ " SELECT ST_NumGeometries(v) AS cells,\n",
+ " ROUND(ST_Area(v), 2) AS total_area_deg2,\n",
+ " ROUND(ST_XMin(v), 2) AS xmin,\n",
+ " ROUND(ST_YMin(v), 2) AS ymin,\n",
+ " ROUND(ST_XMax(v), 2) AS xmax,\n",
+ " ROUND(ST_YMax(v), 2) AS ymax\n",
+ " FROM (SELECT
ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS v\n",
+ " FROM africa)\n",
+ "\"\"\")\n",
+ "stats.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 5. Clip the Voronoi diagram to a continental bounding rectangle\n",
+ "\n",
+ "`clip_by_rect(xmin, ymin, xmax, ymax)` (new in 1.9) is the
geopandas-style way to crop. We use it to confine the Voronoi cells to a
generous Africa bbox so they line up cleanly with the country polygons in the
final plot."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from shapely.wkt import loads as wkt_loads\n",
+ "from shapely.geometry import shape\n",
Review Comment:
`shape` is imported but never used in this notebook. Removing unused imports
helps keep the example minimal and avoids suggesting that `shape(...)` is
needed for the workflow.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]