Copilot commented on code in PR #2954:
URL: https://github.com/apache/sedona/pull/2954#discussion_r3251170193
##########
docs/tutorial/raster.md:
##########
@@ -17,71 +17,43 @@
under the License.
-->
-!!!note
- Sedona uses 1-based indexing for all raster functions except [map algebra
function](../api/sql/Raster-map-algebra.md), which uses 0-based indexing.
+Sedona SQL works with raster data alongside vectors. This tutorial walks a
single dataset through a complete pipeline — load, inspect, visualize, process,
visualize again, save — so you can see what each step produces. Reference
material for additional formats, operators, and Python-side workflows follows
at the end.
!!!note
- Sedona assumes geographic coordinates to be in longitude/latitude order.
If your data is lat/lon order, please use `ST_FlipCoordinates` to swap X and Y.
-
-Starting from `v1.1.0`, Sedona SQL supports raster data sources and raster
operators in DataFrame and SQL. Raster support is available in all Sedona
language bindings including ==Scala, Java, Python, and R==.
-
-This page outlines the steps to manage raster data using SedonaSQL.
-
-=== "Scala"
-
- ```scala
- var myDataFrame = sedona.sql("YOUR_SQL")
- myDataFrame.createOrReplaceTempView("rasterDf")
- ```
-
-=== "Java"
-
- ```java
- Dataset<Row> myDataFrame = sedona.sql("YOUR_SQL")
- myDataFrame.createOrReplaceTempView("rasterDf")
- ```
+ Sedona uses 1-based indexing for all raster functions except [map
algebra](../api/sql/Raster-map-algebra.md), which uses 0-based indexing.
-=== "Python"
-
- ```python
- myDataFrame = sedona.sql("YOUR_SQL")
- myDataFrame.createOrReplaceTempView("rasterDf")
- ```
+!!!note
+ Sedona assumes geographic coordinates are in longitude/latitude order. If
your data is lat/lon, swap axes with `ST_FlipCoordinates`.
-Detailed SedonaSQL APIs are available here: [SedonaSQL
API](../api/sql/Overview.md). You can find example raster data in [Sedona
GitHub
repo](https://github.com/apache/sedona/blob/0eae42576c2588fe278f75cef3b17fee600eac90/spark/common/src/test/resources/raster/raster_with_no_data/test5.tiff).
+Raster support is available in all Sedona language bindings (Scala, Java,
Python, R). Python is the primary language used in the walkthrough;
multi-language tabs appear on the key steps.
## Set up dependencies
=== "Scala/Java"
1. Read [Sedona Maven Central
coordinates](../setup/maven-coordinates.md) and add Sedona dependencies in
build.sbt or pom.xml.
- 2. Add [Apache Spark
core](https://mvnrepository.com/artifact/org.apache.spark/spark-core), [Apache
SparkSQL](https://mvnrepository.com/artifact/org.apache.spark/spark-sql) in
build.sbt or pom.xml.
- 3. Please see [SQL example project](demo.md)
+ 2. Add [Apache Spark
core](https://mvnrepository.com/artifact/org.apache.spark/spark-core) and
[Apache
SparkSQL](https://mvnrepository.com/artifact/org.apache.spark/spark-sql).
+ 3. See the [SQL example project](demo.md).
=== "Python"
- 1. Please read [Quick start](../setup/install-python.md) to install
Sedona Python.
- 2. This tutorial is based on [Sedona SQL Jupyter Notebook
example](jupyter-notebook.md).
-
-## Create Sedona config
+ 1. Follow [Quick start](../setup/install-python.md) to install Sedona
Python.
+ 2. This tutorial mirrors the structure of the [Sedona SQL Jupyter
Notebook example](jupyter-notebook.md).
-Use the following code to create your Sedona config at the beginning. If you
already have a SparkSession (usually named `spark`) created by Wherobots/AWS
EMR/Databricks, please skip this step and use `spark` directly.
+## Create a SedonaContext
-You can add additional Spark runtime config to the config builder. For
example,
`SedonaContext.builder().config("spark.sql.autoBroadcastJoinThreshold",
"10485760")`
+If you already have a SparkSession (Wherobots, AWS EMR, Databricks), skip
ahead and pass it to `SedonaContext.create`. Otherwise:
=== "Scala"
```scala
import org.apache.sedona.spark.SedonaContext
val config = SedonaContext.builder()
- .master("local[*]") // Delete this if run in cluster mode
- .appName("readTestScala") // Change this to a proper name
- .getOrCreate()
- ```
- If you use SedonaViz together with SedonaSQL, please add the following
line after `SedonaContext.builder()` to enable Sedona Kryo serializer:
- ```scala
- .config("spark.kryo.registrator",
classOf[SedonaVizKryoRegistrator].getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
+ .master("local[*]")
+ .appName("rasterTutorial")
+ .getOrCreate()
+ val sedona = SedonaContext.create(config)
Review Comment:
The Scala/Java `SedonaContext.builder()` examples hard-code
`.master("local[*]")` without the usual note to remove it in cluster
environments. In other tutorials this line is annotated as local-only; consider
adding the same guidance here to avoid users accidentally overriding their
cluster master.
##########
docs/tutorial/raster.md:
##########
@@ -90,625 +62,618 @@ You can add additional Spark runtime config to the config
builder. For example,
import org.apache.sedona.spark.SedonaContext;
SparkSession config = SedonaContext.builder()
- .master("local[*]") // Delete this if run in cluster mode
- .appName("readTestScala") // Change this to a proper name
- .getOrCreate()
- ```
- If you use SedonaViz together with SedonaSQL, please add the following
line after `SedonaContext.builder()` to enable Sedona Kryo serializer:
- ```scala
- .config("spark.kryo.registrator",
SedonaVizKryoRegistrator.class.getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
+ .master("local[*]")
+ .appName("rasterTutorial")
+ .getOrCreate();
+ SparkSession sedona = SedonaContext.create(config);
```
=== "Python"
```python
- from sedona.spark import *
-
- config = SedonaContext.builder() .\
- config('spark.jars.packages',
- 'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},'
- 'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools
}}'). \
- getOrCreate()
+ from sedona.spark import SedonaContext
+
+ config = (
+ SedonaContext.builder()
+ .config(
+ "spark.jars.packages",
+ "org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},"
+ "org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}",
+ )
+ .getOrCreate()
+ )
+ sedona = SedonaContext.create(config)
```
- Please replace the `3.3` in the package name of sedona-spark-shaded with
the corresponding major.minor version of Spark, such as
`sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`.
-
-## Initiate SedonaContext
-
-Add the following line after creating the Sedona config. If you already have a
SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks,
please call `SedonaContext.create(spark)` instead.
+ Replace `3.3` with the major.minor version of your Spark install (for
example `sedona-spark-shaded-3.4_2.12`).
-=== "Scala"
-
- ```scala
- import org.apache.sedona.spark.SedonaContext
+You can also register Sedona by passing `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to
`spark-submit` or `spark-shell`.
- val sedona = SedonaContext.create(config)
- ```
+## End-to-end walkthrough
-=== "Java"
+The walkthrough uses a single 2-band GeoTIFF — red and near-infrared
reflectance over a small AOI — and carries it through every stage of a typical
raster workflow. The scene is synthesized in Python so the example is fully
reproducible and ships no extra bytes. The same SQL runs unchanged against real
Sentinel-2 chips; only the input path changes.
- ```java
- import org.apache.sedona.spark.SedonaContext;
+
- SparkSession sedona = SedonaContext.create(config)
- ```
+??? example "What real rasters look like"
-=== "Python"
+ The same code paths handle anything the GeoTIFF spec supports. Two
examples from Sedona's own test resources:
- ```python
- from sedona.spark import *
+ | 3-band color raster | Single-band raster |
+ | :--- | :--- |
+ |  |
 |
- sedona = SedonaContext.create(config)
- ```
+ `RS_NumBands(rast)` would return `3` and `1` respectively. Band-level
functions like `RS_Band(rast, ARRAY(1,2,3))` and `RS_MapAlgebra` work the same
way on both.
-You can also register everything by passing `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to
`spark-submit` or `spark-shell`.
+### 1. Create the input scene
-## Load GeoTiff data
+Synthesize a 256 × 256 raster with a circular vegetated field. Real workflows
skip this step and point Sedona at existing GeoTIFFs on disk or in object
storage.
-The recommended way to load GeoTiff raster data is the `raster` data source.
It loads GeoTiff files and automatically splits them into smaller tiles. Each
tile becomes a row in the resulting DataFrame stored in `Raster` format.
+```python
+import os
+import numpy as np
+import rasterio
+from rasterio.transform import from_bounds
+
Review Comment:
Step 1 imports `numpy` and `rasterio`, but the dependency section doesn't
mention installing these Python packages. Add a short note (e.g., `pip install
numpy rasterio`) so readers don't hit `ModuleNotFoundError` when running the
walkthrough.
##########
docs/tutorial/raster.md:
##########
@@ -90,625 +62,618 @@ You can add additional Spark runtime config to the config
builder. For example,
import org.apache.sedona.spark.SedonaContext;
SparkSession config = SedonaContext.builder()
- .master("local[*]") // Delete this if run in cluster mode
- .appName("readTestScala") // Change this to a proper name
- .getOrCreate()
- ```
- If you use SedonaViz together with SedonaSQL, please add the following
line after `SedonaContext.builder()` to enable Sedona Kryo serializer:
- ```scala
- .config("spark.kryo.registrator",
SedonaVizKryoRegistrator.class.getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
+ .master("local[*]")
+ .appName("rasterTutorial")
+ .getOrCreate();
+ SparkSession sedona = SedonaContext.create(config);
```
=== "Python"
```python
- from sedona.spark import *
-
- config = SedonaContext.builder() .\
- config('spark.jars.packages',
- 'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},'
- 'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools
}}'). \
- getOrCreate()
+ from sedona.spark import SedonaContext
+
+ config = (
+ SedonaContext.builder()
+ .config(
+ "spark.jars.packages",
+ "org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},"
+ "org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}",
+ )
+ .getOrCreate()
+ )
+ sedona = SedonaContext.create(config)
```
- Please replace the `3.3` in the package name of sedona-spark-shaded with
the corresponding major.minor version of Spark, such as
`sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`.
-
-## Initiate SedonaContext
-
-Add the following line after creating the Sedona config. If you already have a
SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks,
please call `SedonaContext.create(spark)` instead.
+ Replace `3.3` with the major.minor version of your Spark install (for
example `sedona-spark-shaded-3.4_2.12`).
-=== "Scala"
-
- ```scala
- import org.apache.sedona.spark.SedonaContext
+You can also register Sedona by passing `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to
`spark-submit` or `spark-shell`.
- val sedona = SedonaContext.create(config)
- ```
+## End-to-end walkthrough
-=== "Java"
+The walkthrough uses a single 2-band GeoTIFF — red and near-infrared
reflectance over a small AOI — and carries it through every stage of a typical
raster workflow. The scene is synthesized in Python so the example is fully
reproducible and ships no extra bytes. The same SQL runs unchanged against real
Sentinel-2 chips; only the input path changes.
- ```java
- import org.apache.sedona.spark.SedonaContext;
+
- SparkSession sedona = SedonaContext.create(config)
- ```
+??? example "What real rasters look like"
-=== "Python"
+ The same code paths handle anything the GeoTIFF spec supports. Two
examples from Sedona's own test resources:
- ```python
- from sedona.spark import *
+ | 3-band color raster | Single-band raster |
+ | :--- | :--- |
+ |  |
 |
- sedona = SedonaContext.create(config)
- ```
+ `RS_NumBands(rast)` would return `3` and `1` respectively. Band-level
functions like `RS_Band(rast, ARRAY(1,2,3))` and `RS_MapAlgebra` work the same
way on both.
-You can also register everything by passing `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to
`spark-submit` or `spark-shell`.
+### 1. Create the input scene
-## Load GeoTiff data
+Synthesize a 256 × 256 raster with a circular vegetated field. Real workflows
skip this step and point Sedona at existing GeoTIFFs on disk or in object
storage.
-The recommended way to load GeoTiff raster data is the `raster` data source.
It loads GeoTiff files and automatically splits them into smaller tiles. Each
tile becomes a row in the resulting DataFrame stored in `Raster` format.
+```python
+import os
+import numpy as np
+import rasterio
+from rasterio.transform import from_bounds
+
+WORK = "/tmp/sedona-raster-tutorial"
+os.makedirs(WORK, exist_ok=True)
+
+AOI = (-91.10, 41.50, -91.00, 41.60) # xmin, ymin, xmax, ymax in EPSG:4326
+W = H = 256
+transform = from_bounds(*AOI, W, H)
+rng = np.random.default_rng(42)
+
+ys, xs = np.mgrid[0:H, 0:W]
+field = ((xs - 96) ** 2 + (ys - 160) ** 2) < 60**2 # circular vegetated field
+
+red = (1500 + 200 * rng.standard_normal((H, W))).clip(0,
10000).astype("uint16")
+nir = (1800 + 200 * rng.standard_normal((H, W))).clip(0, 10000)
+nir = np.where(field, nir + 4000, nir).astype("uint16")
+
+with rasterio.open(
+ f"{WORK}/scene.tif",
+ "w",
+ driver="GTiff",
+ tiled=True,
+ blockxsize=256,
+ blockysize=256,
+ height=H,
+ width=W,
+ count=2,
+ dtype="uint16",
+ crs="EPSG:4326",
+ transform=transform,
+) as dst:
+ dst.write(red, 1)
+ dst.set_band_description(1, "red")
+ dst.write(nir, 2)
+ dst.set_band_description(2, "nir")
+```
+
+### 2. Load with the `raster` data source
+
+The `raster` data source loads GeoTIFFs and automatically splits each file
into tiles. Every tile becomes a row in a DataFrame with a `Raster`-typed
column.
=== "Scala"
- ```scala
- var rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
- rasterDf.createOrReplaceTempView("rasterDf")
- rasterDf.show()
- ```
+
+ ```scala
+ val rasterDf = sedona.read.format("raster").load(s"$WORK/scene.tif")
+ rasterDf.createOrReplaceTempView("rasterDf")
+ rasterDf.show()
+ ```
=== "Java"
- ```java
- Dataset<Row> rasterDf =
sedona.read().format("raster").load("/some/path/*.tif");
- rasterDf.createOrReplaceTempView("rasterDf");
- rasterDf.show();
- ```
+
+ ```java
Review Comment:
The Java snippet uses `WORK + "/scene.tif"`, but `WORK` is only defined in
the Python snippet. Define `WORK` in the Java example (e.g., `String WORK =
...`) or use the literal path so the snippet is self-contained.
##########
docs/tutorial/raster.md:
##########
@@ -90,625 +62,618 @@ You can add additional Spark runtime config to the config
builder. For example,
import org.apache.sedona.spark.SedonaContext;
SparkSession config = SedonaContext.builder()
- .master("local[*]") // Delete this if run in cluster mode
- .appName("readTestScala") // Change this to a proper name
- .getOrCreate()
- ```
- If you use SedonaViz together with SedonaSQL, please add the following
line after `SedonaContext.builder()` to enable Sedona Kryo serializer:
- ```scala
- .config("spark.kryo.registrator",
SedonaVizKryoRegistrator.class.getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
+ .master("local[*]")
+ .appName("rasterTutorial")
+ .getOrCreate();
+ SparkSession sedona = SedonaContext.create(config);
```
=== "Python"
```python
- from sedona.spark import *
-
- config = SedonaContext.builder() .\
- config('spark.jars.packages',
- 'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},'
- 'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools
}}'). \
- getOrCreate()
+ from sedona.spark import SedonaContext
+
+ config = (
+ SedonaContext.builder()
+ .config(
+ "spark.jars.packages",
+ "org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},"
+ "org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}",
+ )
+ .getOrCreate()
+ )
+ sedona = SedonaContext.create(config)
```
- Please replace the `3.3` in the package name of sedona-spark-shaded with
the corresponding major.minor version of Spark, such as
`sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`.
-
-## Initiate SedonaContext
-
-Add the following line after creating the Sedona config. If you already have a
SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks,
please call `SedonaContext.create(spark)` instead.
+ Replace `3.3` with the major.minor version of your Spark install (for
example `sedona-spark-shaded-3.4_2.12`).
-=== "Scala"
-
- ```scala
- import org.apache.sedona.spark.SedonaContext
+You can also register Sedona by passing `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to
`spark-submit` or `spark-shell`.
- val sedona = SedonaContext.create(config)
- ```
+## End-to-end walkthrough
-=== "Java"
+The walkthrough uses a single 2-band GeoTIFF — red and near-infrared
reflectance over a small AOI — and carries it through every stage of a typical
raster workflow. The scene is synthesized in Python so the example is fully
reproducible and ships no extra bytes. The same SQL runs unchanged against real
Sentinel-2 chips; only the input path changes.
- ```java
- import org.apache.sedona.spark.SedonaContext;
+
- SparkSession sedona = SedonaContext.create(config)
- ```
+??? example "What real rasters look like"
-=== "Python"
+ The same code paths handle anything the GeoTIFF spec supports. Two
examples from Sedona's own test resources:
- ```python
- from sedona.spark import *
+ | 3-band color raster | Single-band raster |
+ | :--- | :--- |
+ |  |
 |
- sedona = SedonaContext.create(config)
- ```
+ `RS_NumBands(rast)` would return `3` and `1` respectively. Band-level
functions like `RS_Band(rast, ARRAY(1,2,3))` and `RS_MapAlgebra` work the same
way on both.
-You can also register everything by passing `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to
`spark-submit` or `spark-shell`.
+### 1. Create the input scene
-## Load GeoTiff data
+Synthesize a 256 × 256 raster with a circular vegetated field. Real workflows
skip this step and point Sedona at existing GeoTIFFs on disk or in object
storage.
-The recommended way to load GeoTiff raster data is the `raster` data source.
It loads GeoTiff files and automatically splits them into smaller tiles. Each
tile becomes a row in the resulting DataFrame stored in `Raster` format.
+```python
+import os
+import numpy as np
+import rasterio
+from rasterio.transform import from_bounds
+
+WORK = "/tmp/sedona-raster-tutorial"
+os.makedirs(WORK, exist_ok=True)
+
+AOI = (-91.10, 41.50, -91.00, 41.60) # xmin, ymin, xmax, ymax in EPSG:4326
+W = H = 256
+transform = from_bounds(*AOI, W, H)
+rng = np.random.default_rng(42)
+
+ys, xs = np.mgrid[0:H, 0:W]
+field = ((xs - 96) ** 2 + (ys - 160) ** 2) < 60**2 # circular vegetated field
+
+red = (1500 + 200 * rng.standard_normal((H, W))).clip(0,
10000).astype("uint16")
+nir = (1800 + 200 * rng.standard_normal((H, W))).clip(0, 10000)
+nir = np.where(field, nir + 4000, nir).astype("uint16")
+
+with rasterio.open(
+ f"{WORK}/scene.tif",
+ "w",
+ driver="GTiff",
+ tiled=True,
+ blockxsize=256,
+ blockysize=256,
+ height=H,
+ width=W,
+ count=2,
+ dtype="uint16",
+ crs="EPSG:4326",
+ transform=transform,
+) as dst:
+ dst.write(red, 1)
+ dst.set_band_description(1, "red")
+ dst.write(nir, 2)
+ dst.set_band_description(2, "nir")
+```
+
+### 2. Load with the `raster` data source
+
+The `raster` data source loads GeoTIFFs and automatically splits each file
into tiles. Every tile becomes a row in a DataFrame with a `Raster`-typed
column.
=== "Scala"
- ```scala
- var rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
- rasterDf.createOrReplaceTempView("rasterDf")
- rasterDf.show()
- ```
+
+ ```scala
+ val rasterDf = sedona.read.format("raster").load(s"$WORK/scene.tif")
+ rasterDf.createOrReplaceTempView("rasterDf")
+ rasterDf.show()
Review Comment:
The Scala snippet uses `s"$WORK/scene.tif"`, but `WORK` is only defined in
the Python synthesis snippet. As written, the Scala example won’t compile/run
unless you also define `WORK` in Scala (or replace it with the literal path).
##########
docs/tutorial/raster.md:
##########
@@ -90,625 +62,618 @@ You can add additional Spark runtime config to the config
builder. For example,
import org.apache.sedona.spark.SedonaContext;
SparkSession config = SedonaContext.builder()
- .master("local[*]") // Delete this if run in cluster mode
- .appName("readTestScala") // Change this to a proper name
- .getOrCreate()
- ```
- If you use SedonaViz together with SedonaSQL, please add the following
line after `SedonaContext.builder()` to enable Sedona Kryo serializer:
- ```scala
- .config("spark.kryo.registrator",
SedonaVizKryoRegistrator.class.getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
+ .master("local[*]")
+ .appName("rasterTutorial")
+ .getOrCreate();
+ SparkSession sedona = SedonaContext.create(config);
```
=== "Python"
```python
- from sedona.spark import *
-
- config = SedonaContext.builder() .\
- config('spark.jars.packages',
- 'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},'
- 'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools
}}'). \
- getOrCreate()
+ from sedona.spark import SedonaContext
+
+ config = (
+ SedonaContext.builder()
+ .config(
+ "spark.jars.packages",
+ "org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},"
+ "org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}",
+ )
+ .getOrCreate()
+ )
+ sedona = SedonaContext.create(config)
```
- Please replace the `3.3` in the package name of sedona-spark-shaded with
the corresponding major.minor version of Spark, such as
`sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`.
-
-## Initiate SedonaContext
-
-Add the following line after creating the Sedona config. If you already have a
SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks,
please call `SedonaContext.create(spark)` instead.
+ Replace `3.3` with the major.minor version of your Spark install (for
example `sedona-spark-shaded-3.4_2.12`).
-=== "Scala"
-
- ```scala
- import org.apache.sedona.spark.SedonaContext
+You can also register Sedona by passing `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to
`spark-submit` or `spark-shell`.
- val sedona = SedonaContext.create(config)
- ```
+## End-to-end walkthrough
-=== "Java"
+The walkthrough uses a single 2-band GeoTIFF — red and near-infrared
reflectance over a small AOI — and carries it through every stage of a typical
raster workflow. The scene is synthesized in Python so the example is fully
reproducible and ships no extra bytes. The same SQL runs unchanged against real
Sentinel-2 chips; only the input path changes.
- ```java
- import org.apache.sedona.spark.SedonaContext;
+
- SparkSession sedona = SedonaContext.create(config)
- ```
+??? example "What real rasters look like"
-=== "Python"
+ The same code paths handle anything the GeoTIFF spec supports. Two
examples from Sedona's own test resources:
- ```python
- from sedona.spark import *
+ | 3-band color raster | Single-band raster |
+ | :--- | :--- |
+ |  |
 |
- sedona = SedonaContext.create(config)
- ```
+ `RS_NumBands(rast)` would return `3` and `1` respectively. Band-level
functions like `RS_Band(rast, ARRAY(1,2,3))` and `RS_MapAlgebra` work the same
way on both.
-You can also register everything by passing `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to
`spark-submit` or `spark-shell`.
+### 1. Create the input scene
-## Load GeoTiff data
+Synthesize a 256 × 256 raster with a circular vegetated field. Real workflows
skip this step and point Sedona at existing GeoTIFFs on disk or in object
storage.
-The recommended way to load GeoTiff raster data is the `raster` data source.
It loads GeoTiff files and automatically splits them into smaller tiles. Each
tile becomes a row in the resulting DataFrame stored in `Raster` format.
+```python
+import os
+import numpy as np
+import rasterio
+from rasterio.transform import from_bounds
+
+WORK = "/tmp/sedona-raster-tutorial"
+os.makedirs(WORK, exist_ok=True)
+
+AOI = (-91.10, 41.50, -91.00, 41.60) # xmin, ymin, xmax, ymax in EPSG:4326
+W = H = 256
+transform = from_bounds(*AOI, W, H)
+rng = np.random.default_rng(42)
+
+ys, xs = np.mgrid[0:H, 0:W]
+field = ((xs - 96) ** 2 + (ys - 160) ** 2) < 60**2 # circular vegetated field
+
+red = (1500 + 200 * rng.standard_normal((H, W))).clip(0,
10000).astype("uint16")
+nir = (1800 + 200 * rng.standard_normal((H, W))).clip(0, 10000)
+nir = np.where(field, nir + 4000, nir).astype("uint16")
+
+with rasterio.open(
+ f"{WORK}/scene.tif",
+ "w",
+ driver="GTiff",
+ tiled=True,
+ blockxsize=256,
+ blockysize=256,
+ height=H,
+ width=W,
+ count=2,
+ dtype="uint16",
+ crs="EPSG:4326",
+ transform=transform,
+) as dst:
+ dst.write(red, 1)
+ dst.set_band_description(1, "red")
+ dst.write(nir, 2)
+ dst.set_band_description(2, "nir")
+```
+
+### 2. Load with the `raster` data source
+
+The `raster` data source loads GeoTIFFs and automatically splits each file
into tiles. Every tile becomes a row in a DataFrame with a `Raster`-typed
column.
=== "Scala"
- ```scala
- var rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
- rasterDf.createOrReplaceTempView("rasterDf")
- rasterDf.show()
- ```
+
+ ```scala
+ val rasterDf = sedona.read.format("raster").load(s"$WORK/scene.tif")
+ rasterDf.createOrReplaceTempView("rasterDf")
+ rasterDf.show()
+ ```
=== "Java"
- ```java
- Dataset<Row> rasterDf =
sedona.read().format("raster").load("/some/path/*.tif");
- rasterDf.createOrReplaceTempView("rasterDf");
- rasterDf.show();
- ```
+
+ ```java
+ Dataset<Row> rasterDf = sedona.read().format("raster").load(WORK +
"/scene.tif");
+ rasterDf.createOrReplaceTempView("rasterDf");
+ rasterDf.show();
+ ```
=== "Python"
- ```python
- rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
- rasterDf.createOrReplaceTempView("rasterDf")
- rasterDf.show()
- ```
-The output will look like this:
+ ```python
+ rasterDf = sedona.read.format("raster").load(f"{WORK}/scene.tif")
+ rasterDf.createOrReplaceTempView("rasterDf")
+ rasterDf.show()
+ ```
```
-+--------------------+---+---+----+
-| rast| x| y|name|
-+--------------------+---+---+----+
-|GridCoverage2D["g...| 0| 0| ...|
-|GridCoverage2D["g...| 1| 0| ...|
-|GridCoverage2D["g...| 2| 0| ...|
-...
++--------------------+---+---+----------+
+| rast| x| y| name|
++--------------------+---+---+----------+
+|GridCoverage2D["g...| 0| 0| scene.tif|
++--------------------+---+---+----------+
```
-The output contains the following columns:
+The columns are:
-- `rast`: The raster data in `Raster` format.
-- `x`: The 0-based x-coordinate of the tile. This column is only present when
retile is not disabled.
-- `y`: The 0-based y-coordinate of the tile. This column is only present when
retile is not disabled.
-- `name`: The name of the raster file.
+- `rast` — the raster, in Sedona's `Raster` type.
+- `x`, `y` — the 0-based tile index inside the source file (present when
tiling is enabled).
+- `name` — the source filename.
-### Tiling options
+The 256 × 256 scene fits in a single tile here, so you get one row. A
multi-gigabyte GeoTIFF would yield many rows — the same downstream SQL works in
both cases.
-By default, tiling is enabled (`retile = true`) and the tile size is
determined by the GeoTiff file's internal tiling scheme — you do not need to
specify `tileWidth` or `tileHeight`. It is recommended to use [Cloud Optimized
GeoTIFF (COG)](https://www.cogeo.org/) format for raster data since they
usually organize pixel data as square tiles.
+
-You can optionally override the tile size, or disable tiling entirely:
+See [Loading options](#loading-options) below for tile-size overrides,
recursive directory globs, and non-GeoTIFF formats such as NetCDF and Arc Grid.
-| Option | Default | Description |
-| :--- | :--- | :--- |
-| `retile` | `true` | Whether to enable tiling. Set to `false` to load the
entire raster as a single row. |
-| `tileWidth` | GeoTiff's internal tile width | Optional. Override the width
of each tile in pixels. |
-| `tileHeight` | Same as `tileWidth` if set, otherwise GeoTiff's internal tile
height | Optional. Override the height of each tile in pixels. |
-| `padWithNoData` | `false` | Pad the right and bottom tiles with NODATA
values if they are smaller than the specified tile size. |
-
-To override the tile size:
+### 3. Inspect metadata
-=== "Python"
- ```python
- rasterDf = (
- sedona.read.format("raster")
- .option("tileWidth", "256")
- .option("tileHeight", "256")
- .load("/some/path/*.tif")
- )
- ```
-
-!!!note
- If the internal tiling scheme of raster data is not friendly for tiling,
the `raster` data source will throw an error, and you can disable automatic
tiling using `option("retile", "false")`, or specify the tile size manually to
workaround this issue. A better solution is to translate the raster data into
COG format using `gdal_translate` or other tools.
-
-### Loading raster files from directories
-
-The `raster` data source also works with Spark generic file source options,
such as `option("pathGlobFilter", "*.tif*")` and `option("recursiveFileLookup",
"true")`. For instance, you can load all the `.tif` files recursively in a
directory using:
-
-=== "Python"
- ```python
- rasterDf = (
- sedona.read.format("raster")
- .option("recursiveFileLookup", "true")
- .option("pathGlobFilter", "*.tif*")
- .load(path_to_raster_data_folder)
- )
- ```
+Confirm pixel dimensions, georeference, and CRS before processing:
-!!!tip
- When the loaded path ends with `/`, the `raster` data source will look up
raster files in the directory and all its subdirectories recursively. This is
equivalent to specifying a path without trailing `/` and setting
`option("recursiveFileLookup", "true")`.
+```python
+sedona.sql("""
+ SELECT RS_Width(rast) AS width,
+ RS_Height(rast) AS height,
+ RS_NumBands(rast) AS bands,
+ RS_SRID(rast) AS srid,
+ RS_GeoReference(rast) AS world_file
+ FROM rasterDf
+""").show(truncate=False)
+```
Review Comment:
This metadata example uses `.show(truncate=False)` for `RS_GeoReference`,
which escapes newlines as `\n` (as shown in the sample output). Consider adding
a one-line note or an alternate print/collect snippet so readers understand the
world file is multi-line (see RS_GeoReference docs for an example).
##########
docs/tutorial/raster.zh.md:
##########
@@ -90,621 +62,634 @@ SedonaSQL 详细 API 说明请参阅 [SedonaSQL
API](../api/sql/Overview.md)。
import org.apache.sedona.spark.SedonaContext;
SparkSession config = SedonaContext.builder()
- .master("local[*]") // 集群模式下请删除此行
- .appName("readTestScala") // 改成合适的名字
- .getOrCreate()
- ```
- 如果同时使用 SedonaViz 与 SedonaSQL,请在 `SedonaContext.builder()` 之后追加以下行启用
Sedona Kryo 序列化器:
- ```scala
- .config("spark.kryo.registrator",
SedonaVizKryoRegistrator.class.getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
+ .master("local[*]")
+ .appName("rasterTutorial")
+ .getOrCreate();
+ SparkSession sedona = SedonaContext.create(config);
```
=== "Python"
```python
- from sedona.spark import *
-
- config = SedonaContext.builder() .\
- config('spark.jars.packages',
- 'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},'
- 'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools
}}'). \
- getOrCreate()
+ from sedona.spark import SedonaContext
+
+ config = (
+ SedonaContext.builder()
+ .config(
+ "spark.jars.packages",
+ "org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},"
+ "org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}",
+ )
+ .getOrCreate()
+ )
+ sedona = SedonaContext.create(config)
```
- 请将 sedona-spark-shaded 包名中的 `3.3` 替换为对应的 Spark major.minor 版本,例如
`sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`。
-
-## 初始化 SedonaContext
-
-在创建 Sedona 配置之后加上以下代码。如果您已经有了由 Wherobots / AWS EMR / Databricks 创建的
SparkSession(通常名为 `spark`),请改为调用 `SedonaContext.create(spark)`。
-
-=== "Scala"
+ 请将 `sedona-spark-shaded-3.3` 中的 `3.3` 替换为对应的 Spark 主.次版本号,例如
`sedona-spark-shaded-3.4_2.12`。
- ```scala
- import org.apache.sedona.spark.SedonaContext
+你也可以通过给 `spark-submit` 或 `spark-shell` 传入 `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` 来注册 Sedona。
- val sedona = SedonaContext.create(config)
- ```
+## 端到端教程
-=== "Java"
+本教程贯穿一份 2 波段 GeoTIFF —— 一小块 AOI 上的红光与近红外反射率影像 —— 让它走完一条典型的栅格处理流水线。场景影像在 Python
中合成,因此整个示例可重复运行,仓库无需新增任何二进制数据。对真实的 Sentinel-2 切片,下面同样的 SQL 不用改一行就能跑,仅输入路径需要换一下。
- ```java
- import org.apache.sedona.spark.SedonaContext;
+
- SparkSession sedona = SedonaContext.create(config)
- ```
+??? example "真实栅格长什么样"
-=== "Python"
+ 上面这套代码可处理任何符合 GeoTIFF 规范的数据。下面两个示例来自 Sedona 自带的测试资源:
- ```python
- from sedona.spark import *
+ | 3 波段彩色栅格 | 单波段栅格 |
+ | :--- | :--- |
+ |  |
 |
- sedona = SedonaContext.create(config)
- ```
+ 它们的 `RS_NumBands(rast)` 分别返回 `3` 和 `1`。`RS_Band(rast,
ARRAY(1,2,3))`、`RS_MapAlgebra` 等波段级函数在两类栅格上用法一致。
-也可以通过 `spark-submit` / `spark-shell` 传入 `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` 一并完成注册。
+### 1. 准备输入场景
-## 加载 GeoTiff 数据
+合成一份 256 × 256 的栅格,其中包含一块圆形植被田。真实工作流跳过这一步,直接让 Sedona 读取磁盘或对象存储上已有的 GeoTIFF。
-加载 GeoTiff 栅格数据的推荐方式是 `raster` 数据源。它会加载 GeoTiff 文件并自动将其切分为较小的 tile,每个 tile 在结果
DataFrame 中作为一行,并以 `Raster` 类型存储。
+```python
+import os
+import numpy as np
+import rasterio
+from rasterio.transform import from_bounds
+
Review Comment:
第 1 步示例直接 `import numpy` 和 `import rasterio`,但依赖章节没有提示需要安装这些 Python
包。建议加一句(如 `pip install numpy rasterio`),避免读者运行时遇到 `ModuleNotFoundError`。
##########
docs/tutorial/raster.zh.md:
##########
@@ -17,71 +17,43 @@
under the License.
-->
-!!!note
- Sedona 中所有栅格函数均使用从 1 开始的索引,唯一例外是 [map algebra
函数](../api/sql/Raster-map-algebra.md),它使用从 0 开始的索引。
+Sedona SQL
在矢量之外也原生支持栅格数据。本教程使用同一份数据集,把它从加载、检视、可视化、处理、再次可视化到落盘走完一遍完整的流水线,让你能直接看到每一步产出的结果。其他格式、所有算子的速查、以及
Python 端的额外用法,放在文末的参考章节。
!!!note
- Sedona 假定地理坐标按 lon/lat 顺序排列。如果您的数据是 lat/lon 顺序,请使用 `ST_FlipCoordinates` 交换
X 与 Y。
-
-自 `v1.1.0` 起,Sedona SQL 在 DataFrame 与 SQL
中支持栅格数据源与栅格算子。==Scala、Java、Python、R== 等所有 Sedona 语言绑定都已支持栅格能力。
-
-本页介绍如何使用 SedonaSQL 管理栅格数据。
-
-=== "Scala"
-
- ```scala
- var myDataFrame = sedona.sql("YOUR_SQL")
- myDataFrame.createOrReplaceTempView("rasterDf")
- ```
-
-=== "Java"
-
- ```java
- Dataset<Row> myDataFrame = sedona.sql("YOUR_SQL")
- myDataFrame.createOrReplaceTempView("rasterDf")
- ```
-
-=== "Python"
+ Sedona 中所有栅格函数均使用从 1 开始的索引,唯一例外是 [map
algebra](../api/sql/Raster-map-algebra.md),它使用从 0 开始的索引。
- ```python
- myDataFrame = sedona.sql("YOUR_SQL")
- myDataFrame.createOrReplaceTempView("rasterDf")
- ```
+!!!note
+ Sedona 假定地理坐标按经度/纬度顺序排列。若数据是 lat/lon 顺序,请使用 `ST_FlipCoordinates` 交换 X 与 Y。
-SedonaSQL 详细 API 说明请参阅 [SedonaSQL API](../api/sql/Overview.md)。示例栅格数据可在
[Sedona GitHub
仓库](https://github.com/apache/sedona/blob/0eae42576c2588fe278f75cef3b17fee600eac90/spark/common/src/test/resources/raster/raster_with_no_data/test5.tiff)
中找到。
+Scala、Java、Python、R 等所有 Sedona 语言绑定都已支持栅格能力。本教程以 Python
为主语言进行示例,关键步骤会附上多语言切换标签。
## 配置依赖
=== "Scala/Java"
1. 阅读 [Sedona Maven Central 坐标](../setup/maven-coordinates.md),并在
build.sbt 或 pom.xml 中添加 Sedona 依赖。
- 2. 在 build.sbt 或 pom.xml 中添加 [Apache Spark
core](https://mvnrepository.com/artifact/org.apache.spark/spark-core) 与 [Apache
SparkSQL](https://mvnrepository.com/artifact/org.apache.spark/spark-sql) 依赖。
+ 2. 添加 [Apache Spark
core](https://mvnrepository.com/artifact/org.apache.spark/spark-core) 与 [Apache
SparkSQL](https://mvnrepository.com/artifact/org.apache.spark/spark-sql) 依赖。
3. 参考 [SQL 示例项目](demo.md)。
=== "Python"
- 1. 请阅读 [快速开始](../setup/install-python.md) 安装 Sedona Python。
- 2. 本教程基于 [Sedona SQL Jupyter Notebook 示例](jupyter-notebook.md)。
+ 1. 阅读 [快速开始](../setup/install-python.md) 安装 Sedona Python。
+ 2. 本教程结构参考 [Sedona SQL Jupyter Notebook 示例](jupyter-notebook.md)。
-## 创建 Sedona 配置
+## 创建 SedonaContext
-在程序起始处使用以下代码创建 Sedona 配置。如果您已经有了由 Wherobots / AWS EMR / Databricks 创建的
SparkSession(通常名为 `spark`),可跳过此步骤直接使用 `spark`。
-
-可以在 builder 中追加额外的 Spark 运行时配置,例如
`SedonaContext.builder().config("spark.sql.autoBroadcastJoinThreshold",
"10485760")`。
+若你已经有 SparkSession(例如来自 Wherobots、AWS EMR 或 Databricks),可直接跳过此步并把它传入
`SedonaContext.create`。否则:
=== "Scala"
```scala
import org.apache.sedona.spark.SedonaContext
val config = SedonaContext.builder()
- .master("local[*]") // 集群模式下请删除此行
- .appName("readTestScala") // 改成合适的名字
- .getOrCreate()
- ```
- 如果同时使用 SedonaViz 与 SedonaSQL,请在 `SedonaContext.builder()` 之后追加以下行启用
Sedona Kryo 序列化器:
- ```scala
- .config("spark.kryo.registrator",
classOf[SedonaVizKryoRegistrator].getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
+ .master("local[*]")
+ .appName("rasterTutorial")
+ .getOrCreate()
+ val sedona = SedonaContext.create(config)
Review Comment:
Scala/Java 的 `SedonaContext.builder()` 示例里固定写了
`.master("local[*]")`,但缺少“集群模式下请删除此行”的提示(其他教程里都有这句)。建议补回该说明,避免在集群环境中误覆盖 master
设置。
##########
docs/tutorial/raster.zh.md:
##########
@@ -90,621 +62,634 @@ SedonaSQL 详细 API 说明请参阅 [SedonaSQL
API](../api/sql/Overview.md)。
import org.apache.sedona.spark.SedonaContext;
SparkSession config = SedonaContext.builder()
- .master("local[*]") // 集群模式下请删除此行
- .appName("readTestScala") // 改成合适的名字
- .getOrCreate()
- ```
- 如果同时使用 SedonaViz 与 SedonaSQL,请在 `SedonaContext.builder()` 之后追加以下行启用
Sedona Kryo 序列化器:
- ```scala
- .config("spark.kryo.registrator",
SedonaVizKryoRegistrator.class.getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
+ .master("local[*]")
+ .appName("rasterTutorial")
+ .getOrCreate();
+ SparkSession sedona = SedonaContext.create(config);
```
=== "Python"
```python
- from sedona.spark import *
-
- config = SedonaContext.builder() .\
- config('spark.jars.packages',
- 'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},'
- 'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools
}}'). \
- getOrCreate()
+ from sedona.spark import SedonaContext
+
+ config = (
+ SedonaContext.builder()
+ .config(
+ "spark.jars.packages",
+ "org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},"
+ "org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}",
+ )
+ .getOrCreate()
+ )
+ sedona = SedonaContext.create(config)
```
- 请将 sedona-spark-shaded 包名中的 `3.3` 替换为对应的 Spark major.minor 版本,例如
`sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`。
-
-## 初始化 SedonaContext
-
-在创建 Sedona 配置之后加上以下代码。如果您已经有了由 Wherobots / AWS EMR / Databricks 创建的
SparkSession(通常名为 `spark`),请改为调用 `SedonaContext.create(spark)`。
-
-=== "Scala"
+ 请将 `sedona-spark-shaded-3.3` 中的 `3.3` 替换为对应的 Spark 主.次版本号,例如
`sedona-spark-shaded-3.4_2.12`。
- ```scala
- import org.apache.sedona.spark.SedonaContext
+你也可以通过给 `spark-submit` 或 `spark-shell` 传入 `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` 来注册 Sedona。
- val sedona = SedonaContext.create(config)
- ```
+## 端到端教程
-=== "Java"
+本教程贯穿一份 2 波段 GeoTIFF —— 一小块 AOI 上的红光与近红外反射率影像 —— 让它走完一条典型的栅格处理流水线。场景影像在 Python
中合成,因此整个示例可重复运行,仓库无需新增任何二进制数据。对真实的 Sentinel-2 切片,下面同样的 SQL 不用改一行就能跑,仅输入路径需要换一下。
- ```java
- import org.apache.sedona.spark.SedonaContext;
+
- SparkSession sedona = SedonaContext.create(config)
- ```
+??? example "真实栅格长什么样"
-=== "Python"
+ 上面这套代码可处理任何符合 GeoTIFF 规范的数据。下面两个示例来自 Sedona 自带的测试资源:
- ```python
- from sedona.spark import *
+ | 3 波段彩色栅格 | 单波段栅格 |
+ | :--- | :--- |
+ |  |
 |
- sedona = SedonaContext.create(config)
- ```
+ 它们的 `RS_NumBands(rast)` 分别返回 `3` 和 `1`。`RS_Band(rast,
ARRAY(1,2,3))`、`RS_MapAlgebra` 等波段级函数在两类栅格上用法一致。
-也可以通过 `spark-submit` / `spark-shell` 传入 `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` 一并完成注册。
+### 1. 准备输入场景
-## 加载 GeoTiff 数据
+合成一份 256 × 256 的栅格,其中包含一块圆形植被田。真实工作流跳过这一步,直接让 Sedona 读取磁盘或对象存储上已有的 GeoTIFF。
-加载 GeoTiff 栅格数据的推荐方式是 `raster` 数据源。它会加载 GeoTiff 文件并自动将其切分为较小的 tile,每个 tile 在结果
DataFrame 中作为一行,并以 `Raster` 类型存储。
+```python
+import os
+import numpy as np
+import rasterio
+from rasterio.transform import from_bounds
+
+WORK = "/tmp/sedona-raster-tutorial"
+os.makedirs(WORK, exist_ok=True)
+
+AOI = (-91.10, 41.50, -91.00, 41.60) # xmin, ymin, xmax, ymax,EPSG:4326
+W = H = 256
+transform = from_bounds(*AOI, W, H)
+rng = np.random.default_rng(42)
+
+ys, xs = np.mgrid[0:H, 0:W]
+field = ((xs - 96) ** 2 + (ys - 160) ** 2) < 60**2 # 圆形植被田
+
+red = (1500 + 200 * rng.standard_normal((H, W))).clip(0,
10000).astype("uint16")
+nir = (1800 + 200 * rng.standard_normal((H, W))).clip(0, 10000)
+nir = np.where(field, nir + 4000, nir).astype("uint16")
+
+with rasterio.open(
+ f"{WORK}/scene.tif",
+ "w",
+ driver="GTiff",
+ tiled=True,
+ blockxsize=256,
+ blockysize=256,
+ height=H,
+ width=W,
+ count=2,
+ dtype="uint16",
+ crs="EPSG:4326",
+ transform=transform,
+) as dst:
+ dst.write(red, 1)
+ dst.set_band_description(1, "red")
+ dst.write(nir, 2)
+ dst.set_band_description(2, "nir")
+```
+
+### 2. 使用 `raster` 数据源加载
+
+`raster` 数据源可以加载 GeoTIFF 并自动将文件切成多个 tile。每一个 tile 在结果 DataFrame 中对应一行,`Raster`
类型保存于其中的一列。
=== "Scala"
- ```scala
- var rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
- rasterDf.createOrReplaceTempView("rasterDf")
- rasterDf.show()
- ```
+
+ ```scala
+ val rasterDf = sedona.read.format("raster").load(s"$WORK/scene.tif")
+ rasterDf.createOrReplaceTempView("rasterDf")
+ rasterDf.show()
+ ```
=== "Java"
- ```java
- Dataset<Row> rasterDf =
sedona.read().format("raster").load("/some/path/*.tif");
- rasterDf.createOrReplaceTempView("rasterDf");
- rasterDf.show();
- ```
+
+ ```java
Review Comment:
Java 代码片段使用 `WORK + "/scene.tif"`,但 `WORK` 只在 Python 代码里定义。建议在 Java 片段里补
`String WORK = ...`(或改成直接写绝对路径),让示例自包含可运行。
##########
docs/tutorial/raster.md:
##########
@@ -90,625 +62,618 @@ You can add additional Spark runtime config to the config
builder. For example,
import org.apache.sedona.spark.SedonaContext;
SparkSession config = SedonaContext.builder()
- .master("local[*]") // Delete this if run in cluster mode
- .appName("readTestScala") // Change this to a proper name
- .getOrCreate()
- ```
- If you use SedonaViz together with SedonaSQL, please add the following
line after `SedonaContext.builder()` to enable Sedona Kryo serializer:
- ```scala
- .config("spark.kryo.registrator",
SedonaVizKryoRegistrator.class.getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
+ .master("local[*]")
+ .appName("rasterTutorial")
+ .getOrCreate();
+ SparkSession sedona = SedonaContext.create(config);
```
=== "Python"
```python
- from sedona.spark import *
-
- config = SedonaContext.builder() .\
- config('spark.jars.packages',
- 'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},'
- 'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools
}}'). \
- getOrCreate()
+ from sedona.spark import SedonaContext
+
+ config = (
+ SedonaContext.builder()
+ .config(
+ "spark.jars.packages",
+ "org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},"
+ "org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}",
+ )
+ .getOrCreate()
+ )
+ sedona = SedonaContext.create(config)
```
- Please replace the `3.3` in the package name of sedona-spark-shaded with
the corresponding major.minor version of Spark, such as
`sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`.
-
-## Initiate SedonaContext
-
-Add the following line after creating the Sedona config. If you already have a
SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks,
please call `SedonaContext.create(spark)` instead.
+ Replace `3.3` with the major.minor version of your Spark install (for
example `sedona-spark-shaded-3.4_2.12`).
-=== "Scala"
-
- ```scala
- import org.apache.sedona.spark.SedonaContext
+You can also register Sedona by passing `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to
`spark-submit` or `spark-shell`.
- val sedona = SedonaContext.create(config)
- ```
+## End-to-end walkthrough
-=== "Java"
+The walkthrough uses a single 2-band GeoTIFF — red and near-infrared
reflectance over a small AOI — and carries it through every stage of a typical
raster workflow. The scene is synthesized in Python so the example is fully
reproducible and ships no extra bytes. The same SQL runs unchanged against real
Sentinel-2 chips; only the input path changes.
- ```java
- import org.apache.sedona.spark.SedonaContext;
+
- SparkSession sedona = SedonaContext.create(config)
- ```
+??? example "What real rasters look like"
-=== "Python"
+ The same code paths handle anything the GeoTIFF spec supports. Two
examples from Sedona's own test resources:
- ```python
- from sedona.spark import *
+ | 3-band color raster | Single-band raster |
+ | :--- | :--- |
+ |  |
 |
- sedona = SedonaContext.create(config)
- ```
+ `RS_NumBands(rast)` would return `3` and `1` respectively. Band-level
functions like `RS_Band(rast, ARRAY(1,2,3))` and `RS_MapAlgebra` work the same
way on both.
-You can also register everything by passing `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to
`spark-submit` or `spark-shell`.
+### 1. Create the input scene
-## Load GeoTiff data
+Synthesize a 256 × 256 raster with a circular vegetated field. Real workflows
skip this step and point Sedona at existing GeoTIFFs on disk or in object
storage.
-The recommended way to load GeoTiff raster data is the `raster` data source.
It loads GeoTiff files and automatically splits them into smaller tiles. Each
tile becomes a row in the resulting DataFrame stored in `Raster` format.
+```python
+import os
+import numpy as np
+import rasterio
+from rasterio.transform import from_bounds
+
+WORK = "/tmp/sedona-raster-tutorial"
+os.makedirs(WORK, exist_ok=True)
+
+AOI = (-91.10, 41.50, -91.00, 41.60) # xmin, ymin, xmax, ymax in EPSG:4326
+W = H = 256
+transform = from_bounds(*AOI, W, H)
+rng = np.random.default_rng(42)
+
+ys, xs = np.mgrid[0:H, 0:W]
+field = ((xs - 96) ** 2 + (ys - 160) ** 2) < 60**2 # circular vegetated field
+
+red = (1500 + 200 * rng.standard_normal((H, W))).clip(0,
10000).astype("uint16")
+nir = (1800 + 200 * rng.standard_normal((H, W))).clip(0, 10000)
+nir = np.where(field, nir + 4000, nir).astype("uint16")
+
+with rasterio.open(
+ f"{WORK}/scene.tif",
+ "w",
+ driver="GTiff",
+ tiled=True,
+ blockxsize=256,
+ blockysize=256,
+ height=H,
+ width=W,
+ count=2,
+ dtype="uint16",
+ crs="EPSG:4326",
+ transform=transform,
+) as dst:
+ dst.write(red, 1)
+ dst.set_band_description(1, "red")
+ dst.write(nir, 2)
+ dst.set_band_description(2, "nir")
+```
+
+### 2. Load with the `raster` data source
+
+The `raster` data source loads GeoTIFFs and automatically splits each file
into tiles. Every tile becomes a row in a DataFrame with a `Raster`-typed
column.
=== "Scala"
- ```scala
- var rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
- rasterDf.createOrReplaceTempView("rasterDf")
- rasterDf.show()
- ```
+
+ ```scala
+ val rasterDf = sedona.read.format("raster").load(s"$WORK/scene.tif")
+ rasterDf.createOrReplaceTempView("rasterDf")
+ rasterDf.show()
+ ```
=== "Java"
- ```java
- Dataset<Row> rasterDf =
sedona.read().format("raster").load("/some/path/*.tif");
- rasterDf.createOrReplaceTempView("rasterDf");
- rasterDf.show();
- ```
+
+ ```java
+ Dataset<Row> rasterDf = sedona.read().format("raster").load(WORK +
"/scene.tif");
+ rasterDf.createOrReplaceTempView("rasterDf");
+ rasterDf.show();
+ ```
=== "Python"
- ```python
- rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
- rasterDf.createOrReplaceTempView("rasterDf")
- rasterDf.show()
- ```
-The output will look like this:
+ ```python
+ rasterDf = sedona.read.format("raster").load(f"{WORK}/scene.tif")
+ rasterDf.createOrReplaceTempView("rasterDf")
+ rasterDf.show()
+ ```
```
-+--------------------+---+---+----+
-| rast| x| y|name|
-+--------------------+---+---+----+
-|GridCoverage2D["g...| 0| 0| ...|
-|GridCoverage2D["g...| 1| 0| ...|
-|GridCoverage2D["g...| 2| 0| ...|
-...
++--------------------+---+---+----------+
+| rast| x| y| name|
++--------------------+---+---+----------+
+|GridCoverage2D["g...| 0| 0| scene.tif|
++--------------------+---+---+----------+
```
-The output contains the following columns:
+The columns are:
-- `rast`: The raster data in `Raster` format.
-- `x`: The 0-based x-coordinate of the tile. This column is only present when
retile is not disabled.
-- `y`: The 0-based y-coordinate of the tile. This column is only present when
retile is not disabled.
-- `name`: The name of the raster file.
+- `rast` — the raster, in Sedona's `Raster` type.
+- `x`, `y` — the 0-based tile index inside the source file (present when
tiling is enabled).
+- `name` — the source filename.
-### Tiling options
+The 256 × 256 scene fits in a single tile here, so you get one row. A
multi-gigabyte GeoTIFF would yield many rows — the same downstream SQL works in
both cases.
-By default, tiling is enabled (`retile = true`) and the tile size is
determined by the GeoTiff file's internal tiling scheme — you do not need to
specify `tileWidth` or `tileHeight`. It is recommended to use [Cloud Optimized
GeoTIFF (COG)](https://www.cogeo.org/) format for raster data since they
usually organize pixel data as square tiles.
+
-You can optionally override the tile size, or disable tiling entirely:
+See [Loading options](#loading-options) below for tile-size overrides,
recursive directory globs, and non-GeoTIFF formats such as NetCDF and Arc Grid.
-| Option | Default | Description |
-| :--- | :--- | :--- |
-| `retile` | `true` | Whether to enable tiling. Set to `false` to load the
entire raster as a single row. |
-| `tileWidth` | GeoTiff's internal tile width | Optional. Override the width
of each tile in pixels. |
-| `tileHeight` | Same as `tileWidth` if set, otherwise GeoTiff's internal tile
height | Optional. Override the height of each tile in pixels. |
-| `padWithNoData` | `false` | Pad the right and bottom tiles with NODATA
values if they are smaller than the specified tile size. |
-
-To override the tile size:
+### 3. Inspect metadata
-=== "Python"
- ```python
- rasterDf = (
- sedona.read.format("raster")
- .option("tileWidth", "256")
- .option("tileHeight", "256")
- .load("/some/path/*.tif")
- )
- ```
-
-!!!note
- If the internal tiling scheme of raster data is not friendly for tiling,
the `raster` data source will throw an error, and you can disable automatic
tiling using `option("retile", "false")`, or specify the tile size manually to
workaround this issue. A better solution is to translate the raster data into
COG format using `gdal_translate` or other tools.
-
-### Loading raster files from directories
-
-The `raster` data source also works with Spark generic file source options,
such as `option("pathGlobFilter", "*.tif*")` and `option("recursiveFileLookup",
"true")`. For instance, you can load all the `.tif` files recursively in a
directory using:
-
-=== "Python"
- ```python
- rasterDf = (
- sedona.read.format("raster")
- .option("recursiveFileLookup", "true")
- .option("pathGlobFilter", "*.tif*")
- .load(path_to_raster_data_folder)
- )
- ```
+Confirm pixel dimensions, georeference, and CRS before processing:
-!!!tip
- When the loaded path ends with `/`, the `raster` data source will look up
raster files in the directory and all its subdirectories recursively. This is
equivalent to specifying a path without trailing `/` and setting
`option("recursiveFileLookup", "true")`.
+```python
+sedona.sql("""
+ SELECT RS_Width(rast) AS width,
+ RS_Height(rast) AS height,
+ RS_NumBands(rast) AS bands,
+ RS_SRID(rast) AS srid,
+ RS_GeoReference(rast) AS world_file
+ FROM rasterDf
+""").show(truncate=False)
+```
-!!!tip
- After loading rasters, you can quickly visualize them in a Jupyter
notebook using `SedonaUtils.display_image(df)`. It automatically detects raster
columns and renders them as images. See [Raster visualizer
docs](../api/sql/Raster-Functions.md#raster-output) for details.
+```
++-----+------+-----+----+----------------------------------------------------------+
+|width|height|bands|srid|world_file
|
++-----+------+-----+----+----------------------------------------------------------+
+|256 |256 |2
|4326|0.000391\n0.000000\n0.000000\n-0.000391\n-91.099805\n41.599805|
++-----+------+-----+----+----------------------------------------------------------+
+```
-## Load non-GeoTiff data (NetCDF, Arc Grid)
+[`RS_MetaData`](../api/sql/Raster-Operators/RS_MetaData.md) returns the same
information as a single array: `[upperLeftX, upperLeftY, width, height, scaleX,
scaleY, skewX, skewY, srid, numBands]`.
-For non-GeoTiff raster formats such as NetCDF or Arc Info ASCII Grid, use the
Spark built-in `binaryFile` data source together with Sedona raster
constructors.
+The georeference fields define an affine transform from pixel space to world
space:
-### Step 1: Load to a binary DataFrame
+
-=== "Scala"
- ```scala
- var rawDf = sedona.read.format("binaryFile").load(path_to_raster_data)
- rawDf.createOrReplaceTempView("rawdf")
- rawDf.show()
- ```
+See [Raster metadata](#raster-metadata-reference) for every accessor and
[`RS_PixelAsPoint`](../api/sql/Pixel-Functions/RS_PixelAsPoint.md) /
[`RS_WorldToRasterCoord`](../api/sql/Raster-Accessors/RS_WorldToRasterCoord.md)
for the runtime conversions.
-=== "Java"
- ```java
- Dataset<Row> rawDf =
sedona.read().format("binaryFile").load(path_to_raster_data);
- rawDf.createOrReplaceTempView("rawdf");
- rawDf.show();
- ```
+### 4. Visualize the raw raster
-=== "Python"
- ```python
- rawDf = sedona.read.format("binaryFile").load(path_to_raster_data)
- rawDf.createOrReplaceTempView("rawdf")
- rawDf.show()
- ```
+Render the two bands so you can see the input before any processing.
`SedonaUtils.display_image` auto-detects raster columns inside a Jupyter
notebook and renders them inline:
-The output will look like this:
+```python
+from sedona.spark import SedonaUtils
-```
-| path| modificationTime|length| content|
-+--------------------+--------------------+------+--------------------+
-|file:/Download/ra...|2023-09-06 16:24:...|174803|[49 49 2A 00 08 0...|
+SedonaUtils.display_image(
+ sedona.sql("SELECT RS_Band(rast, ARRAY(1)) AS rast FROM rasterDf")
+)
+SedonaUtils.display_image(
+ sedona.sql("SELECT RS_Band(rast, ARRAY(2)) AS rast FROM rasterDf")
+)
```
-For multiple raster data files, you can load them recursively:
+Band 1 is the red channel — mostly featureless bare ground. Band 2 (NIR)
lights up over the vegetated field:
-=== "Python"
- ```python
- rawDf = (
- sedona.read.format("binaryFile")
- .option("recursiveFileLookup", "true")
- .option("pathGlobFilter", "*.asc*")
- .load(path_to_raster_data_folder)
- )
- rawDf.createOrReplaceTempView("rawdf")
- rawDf.show()
- ```
+| Band 1 (red) | Band 2 (NIR) |
+| :--- | :--- |
+|  |  |
-### Step 2: Create a Raster type column
+Outside a notebook, use [`RS_AsImage(rast,
width)`](../api/sql/Raster-Output/RS_AsImage.md) to get an HTML `<img>` tag, or
[`RS_AsBase64`](../api/sql/Raster-Output/RS_AsBase64.md) for a Base64 string
that any image viewer can decode.
-All raster operations in SedonaSQL require Raster type objects. Use one of the
following constructors:
+### 5. Process — compute NDVI with map algebra
-#### From GeoTiff
+The Normalized Difference Vegetation Index isolates live vegetation:
-```sql
-SELECT RS_FromGeoTiff(content) AS rast, modificationTime, length, path FROM
rawdf
```
-
-#### From Arc Grid
-
-```sql
-SELECT RS_FromArcInfoAsciiGrid(content) AS rast, modificationTime, length,
path FROM rawdf
+NDVI = (NIR − Red) / (NIR + Red)
```
-#### From NetCDF
-
-See [RS_FromNetCDF](../api/sql/Raster-Constructors/RS_FromNetCDF.md) for
details on loading NetCDF files.
-
-To verify the raster column was created successfully:
+[`RS_MapAlgebra`](../api/sql/Raster-map-algebra.md) runs a per-pixel script
over one or more bands. Output type `'D'` (double) preserves the negative side
of the NDVI range:
```python
-rasterDf.printSchema()
-```
-
-```
-root
- |-- rast: raster (nullable = true)
- |-- modificationTime: timestamp (nullable = true)
- |-- length: long (nullable = true)
- |-- path: string (nullable = true)
+ndviDf = sedona.sql("""
+ SELECT RS_MapAlgebra(
+ rast, 'D',
+ 'out[0] = (rast[1] - rast[0]) / (rast[1] + rast[0] + 1e-6);'
+ ) AS rast
+ FROM rasterDf
+""")
+ndviDf.createOrReplaceTempView("ndviDf")
```
-## Raster's metadata
-
-Sedona has a function to get the metadata for the raster, and also a function
to get the world file of the raster.
+
-### Metadata
+Map algebra is the most general processing primitive — clipping, masking,
thresholding, and arithmetic between bands or between rasters all fit the same
`RS_MapAlgebra(rast, pixelType, script)` (or two-raster) shape. See [Map
algebra](../api/sql/Raster-map-algebra.md) for the script syntax and [Raster
processing](#raster-processing-reference) below for alternatives (`RS_Clip`,
`RS_Resample`, `RS_SetValues`).
-This function will return an array of metadata, it will have all the necessary
information about the raster, Please refer to
[RS_MetaData](../api/sql/Raster-Operators/RS_MetaData.md).
+### 6. Visualize the processed raster
-```sql
-SELECT RS_MetaData(rast) FROM rasterDf
-```
-
-Output for the following function will be:
+The NDVI raster makes the vegetated field obvious: green pixels where NDVI is
high, washed-out red elsewhere.
+```python
+SedonaUtils.display_image(ndviDf)
```
-[-1.3095817809482181E7, 4021262.7487925636, 512.0, 517.0, 72.32861272132695,
-72.32861272132695, 0.0, 0.0, 3857.0, 1.0]
-```
-
-The first two elements of the array represent the real-world geographic
coordinates (like longitude/latitude) of the raster image's top left pixel,
while the next two elements represent the pixel dimensions of the raster.
-### World File
+
-There are two kinds of georeferences, GDAL and ESRI seen in [world
files](https://en.wikipedia.org/wiki/World_file). For more information please
refer to [RS_GeoReference](../api/sql/Raster-Accessors/RS_GeoReference.md).
+### 7. Aggregate to vector zones with zonal stats
-```sql
-SELECT RS_GeoReference(rast, "ESRI") FROM rasterDf
-```
-
-The Output will be as follows:
-
-```
-72.328613
-0.000000
-0.000000
--72.328613
--13095781.645176
-4021226.584486
-```
+NDVI per pixel is rarely the deliverable. The question is usually "which
*area* greened up?" — which farm parcel, census block, or watershed.
[`RS_ZonalStats(raster, zone,
statType)`](../api/sql/Raster-Band-Accessors/RS_ZonalStats.md) is the canonical
raster → vector aggregation: every pixel that falls inside a zone polygon
contributes to the statistic.
-World files are used to georeference and geolocate images by establishing an
image-to-world coordinate transformation that assigns real-world geographic
coordinates to the pixels of the image.
+Real parcel boundaries are irregular — odd-shaped fields, road easements
between them, gaps that aren't part of any zone. Define five hand-drawn parcels
over the AOI:
-## Raster Manipulation
-
-Since `v1.5.0` there have been many additions to manipulate raster data, we
will show you a few example queries.
+```python
+from pyspark.sql import Row
+
+parcels = sedona.createDataFrame(
+ [
+ Row(
+ parcel_id="Orchard",
+ wkt="POLYGON((-91.085 41.515, -91.045 41.510, -91.030 41.530, "
+ "-91.040 41.560, -91.075 41.572, -91.085 41.555, -91.085 41.515))",
+ ),
+ Row(
+ parcel_id="EastFarm",
+ wkt="POLYGON((-91.025 41.512, -91.005 41.512, -91.005 41.572, "
+ "-91.035 41.572, -91.025 41.535, -91.025 41.512))",
+ ),
+ Row(
+ parcel_id="WestFarm",
+ wkt="POLYGON((-91.095 41.520, -91.087 41.520, -91.080 41.555, "
+ "-91.080 41.572, -91.095 41.572, -91.095 41.520))",
+ ),
+ Row(
+ parcel_id="NorthBlock",
+ wkt="POLYGON((-91.095 41.580, -91.005 41.580, -91.005 41.598, "
+ "-91.095 41.598, -91.095 41.580))",
+ ),
+ Row(
+ parcel_id="SouthStrip",
+ wkt="POLYGON((-91.095 41.502, -91.005 41.502, -91.005 41.508, "
+ "-91.095 41.508, -91.095 41.502))",
+ ),
+ ]
+).selectExpr("parcel_id", "ST_GeomFromText(wkt) AS geom")
+
+parcels.createOrReplaceTempView("parcels")
+
+ranked = sedona.sql("""
+ SELECT p.parcel_id,
+ ROUND(RS_ZonalStats(n.rast, p.geom, 'mean'), 4) AS mean_ndvi
+ FROM parcels p, ndviDf n
+ ORDER BY mean_ndvi DESC
+""")
+ranked.show()
+```
+
+```
++----------+---------+
+| parcel_id|mean_ndvi|
++----------+---------+
+| Orchard| 0.4213|
+| WestFarm| 0.1182|
+|NorthBlock| 0.0925|
+| EastFarm| 0.0907|
+|SouthStrip| 0.0905|
++----------+---------+
+```
+
+The irregular **Orchard** parcel wins decisively — that's the polygon that
overlaps the vegetated field. Pixels in the gaps between parcels (roads,
untracked land) contribute to no zone and don't affect any statistic.
+
+
!!!note
- Read [SedonaSQL Raster
operators](../api/sql/Raster-Functions.md#raster-operators) to learn how you
can use Sedona for raster manipulation.
+ With a tiled input, the `parcels × ndviDf` cross-join produces one row per
`(parcel, tile)`. To aggregate properly across tiles, compute per-tile `sum`
and `count` and roll them up: `SUM(sum) / SUM(count) GROUP BY parcel_id`. Same
idiom, one extra aggregation.
[`RS_ZonalStatsAll`](../api/sql/Raster-Band-Accessors/RS_ZonalStatsAll.md)
returns every standard statistic in a single call.
-### Coordinate translation
+### 8. Save back to disk
-Sedona allows you to translate coordinates as per your needs. It can translate
pixel locations to world coordinates and vice versa.
+Writing is a two-step pattern: convert the `Raster` column to a binary format
with `RS_AsXXX`, then hand the binary DataFrame to Sedona's `raster` writer.
-#### PixelAsPoint
+
-Use [RS_PixelAsPoint](../api/sql/Pixel-Functions/RS_PixelAsPoint.md) to
translate pixel coordinates to world location.
+=== "Scala"
-```sql
-SELECT RS_PixelAsPoint(rast, 450, 400) FROM rasterDf
-```
+ ```scala
+ import org.apache.spark.sql.functions.expr
Review Comment:
The Scala write example uses `s"$WORK/ndvi_out"`, but `WORK` is only defined
in the Python snippet. Define `WORK` in Scala (or use the literal output path)
so this snippet is runnable as-is.
##########
docs/tutorial/raster.zh.md:
##########
@@ -90,621 +62,634 @@ SedonaSQL 详细 API 说明请参阅 [SedonaSQL
API](../api/sql/Overview.md)。
import org.apache.sedona.spark.SedonaContext;
SparkSession config = SedonaContext.builder()
- .master("local[*]") // 集群模式下请删除此行
- .appName("readTestScala") // 改成合适的名字
- .getOrCreate()
- ```
- 如果同时使用 SedonaViz 与 SedonaSQL,请在 `SedonaContext.builder()` 之后追加以下行启用
Sedona Kryo 序列化器:
- ```scala
- .config("spark.kryo.registrator",
SedonaVizKryoRegistrator.class.getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
+ .master("local[*]")
+ .appName("rasterTutorial")
+ .getOrCreate();
+ SparkSession sedona = SedonaContext.create(config);
```
=== "Python"
```python
- from sedona.spark import *
-
- config = SedonaContext.builder() .\
- config('spark.jars.packages',
- 'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},'
- 'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools
}}'). \
- getOrCreate()
+ from sedona.spark import SedonaContext
+
+ config = (
+ SedonaContext.builder()
+ .config(
+ "spark.jars.packages",
+ "org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},"
+ "org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}",
+ )
+ .getOrCreate()
+ )
+ sedona = SedonaContext.create(config)
```
- 请将 sedona-spark-shaded 包名中的 `3.3` 替换为对应的 Spark major.minor 版本,例如
`sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`。
-
-## 初始化 SedonaContext
-
-在创建 Sedona 配置之后加上以下代码。如果您已经有了由 Wherobots / AWS EMR / Databricks 创建的
SparkSession(通常名为 `spark`),请改为调用 `SedonaContext.create(spark)`。
-
-=== "Scala"
+ 请将 `sedona-spark-shaded-3.3` 中的 `3.3` 替换为对应的 Spark 主.次版本号,例如
`sedona-spark-shaded-3.4_2.12`。
- ```scala
- import org.apache.sedona.spark.SedonaContext
+你也可以通过给 `spark-submit` 或 `spark-shell` 传入 `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` 来注册 Sedona。
- val sedona = SedonaContext.create(config)
- ```
+## 端到端教程
-=== "Java"
+本教程贯穿一份 2 波段 GeoTIFF —— 一小块 AOI 上的红光与近红外反射率影像 —— 让它走完一条典型的栅格处理流水线。场景影像在 Python
中合成,因此整个示例可重复运行,仓库无需新增任何二进制数据。对真实的 Sentinel-2 切片,下面同样的 SQL 不用改一行就能跑,仅输入路径需要换一下。
- ```java
- import org.apache.sedona.spark.SedonaContext;
+
- SparkSession sedona = SedonaContext.create(config)
- ```
+??? example "真实栅格长什么样"
-=== "Python"
+ 上面这套代码可处理任何符合 GeoTIFF 规范的数据。下面两个示例来自 Sedona 自带的测试资源:
- ```python
- from sedona.spark import *
+ | 3 波段彩色栅格 | 单波段栅格 |
+ | :--- | :--- |
+ |  |
 |
- sedona = SedonaContext.create(config)
- ```
+ 它们的 `RS_NumBands(rast)` 分别返回 `3` 和 `1`。`RS_Band(rast,
ARRAY(1,2,3))`、`RS_MapAlgebra` 等波段级函数在两类栅格上用法一致。
-也可以通过 `spark-submit` / `spark-shell` 传入 `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` 一并完成注册。
+### 1. 准备输入场景
-## 加载 GeoTiff 数据
+合成一份 256 × 256 的栅格,其中包含一块圆形植被田。真实工作流跳过这一步,直接让 Sedona 读取磁盘或对象存储上已有的 GeoTIFF。
-加载 GeoTiff 栅格数据的推荐方式是 `raster` 数据源。它会加载 GeoTiff 文件并自动将其切分为较小的 tile,每个 tile 在结果
DataFrame 中作为一行,并以 `Raster` 类型存储。
+```python
+import os
+import numpy as np
+import rasterio
+from rasterio.transform import from_bounds
+
+WORK = "/tmp/sedona-raster-tutorial"
+os.makedirs(WORK, exist_ok=True)
+
+AOI = (-91.10, 41.50, -91.00, 41.60) # xmin, ymin, xmax, ymax,EPSG:4326
+W = H = 256
+transform = from_bounds(*AOI, W, H)
+rng = np.random.default_rng(42)
+
+ys, xs = np.mgrid[0:H, 0:W]
+field = ((xs - 96) ** 2 + (ys - 160) ** 2) < 60**2 # 圆形植被田
+
+red = (1500 + 200 * rng.standard_normal((H, W))).clip(0,
10000).astype("uint16")
+nir = (1800 + 200 * rng.standard_normal((H, W))).clip(0, 10000)
+nir = np.where(field, nir + 4000, nir).astype("uint16")
+
+with rasterio.open(
+ f"{WORK}/scene.tif",
+ "w",
+ driver="GTiff",
+ tiled=True,
+ blockxsize=256,
+ blockysize=256,
+ height=H,
+ width=W,
+ count=2,
+ dtype="uint16",
+ crs="EPSG:4326",
+ transform=transform,
+) as dst:
+ dst.write(red, 1)
+ dst.set_band_description(1, "red")
+ dst.write(nir, 2)
+ dst.set_band_description(2, "nir")
+```
+
+### 2. 使用 `raster` 数据源加载
+
+`raster` 数据源可以加载 GeoTIFF 并自动将文件切成多个 tile。每一个 tile 在结果 DataFrame 中对应一行,`Raster`
类型保存于其中的一列。
=== "Scala"
- ```scala
- var rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
- rasterDf.createOrReplaceTempView("rasterDf")
- rasterDf.show()
- ```
+
+ ```scala
+ val rasterDf = sedona.read.format("raster").load(s"$WORK/scene.tif")
+ rasterDf.createOrReplaceTempView("rasterDf")
+ rasterDf.show()
Review Comment:
Scala 代码片段使用了 `s"$WORK/scene.tif"`,但 `WORK` 只在前面的 Python 合成脚本里定义。按当前写法 Scala
片段无法直接运行;建议在 Scala 片段里也定义 `WORK`(或改成直接写绝对路径)。
##########
docs/tutorial/raster.zh.md:
##########
@@ -90,621 +62,634 @@ SedonaSQL 详细 API 说明请参阅 [SedonaSQL
API](../api/sql/Overview.md)。
import org.apache.sedona.spark.SedonaContext;
SparkSession config = SedonaContext.builder()
- .master("local[*]") // 集群模式下请删除此行
- .appName("readTestScala") // 改成合适的名字
- .getOrCreate()
- ```
- 如果同时使用 SedonaViz 与 SedonaSQL,请在 `SedonaContext.builder()` 之后追加以下行启用
Sedona Kryo 序列化器:
- ```scala
- .config("spark.kryo.registrator",
SedonaVizKryoRegistrator.class.getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
+ .master("local[*]")
+ .appName("rasterTutorial")
+ .getOrCreate();
+ SparkSession sedona = SedonaContext.create(config);
```
=== "Python"
```python
- from sedona.spark import *
-
- config = SedonaContext.builder() .\
- config('spark.jars.packages',
- 'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},'
- 'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools
}}'). \
- getOrCreate()
+ from sedona.spark import SedonaContext
+
+ config = (
+ SedonaContext.builder()
+ .config(
+ "spark.jars.packages",
+ "org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},"
+ "org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}",
+ )
+ .getOrCreate()
+ )
+ sedona = SedonaContext.create(config)
```
- 请将 sedona-spark-shaded 包名中的 `3.3` 替换为对应的 Spark major.minor 版本,例如
`sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`。
-
-## 初始化 SedonaContext
-
-在创建 Sedona 配置之后加上以下代码。如果您已经有了由 Wherobots / AWS EMR / Databricks 创建的
SparkSession(通常名为 `spark`),请改为调用 `SedonaContext.create(spark)`。
-
-=== "Scala"
+ 请将 `sedona-spark-shaded-3.3` 中的 `3.3` 替换为对应的 Spark 主.次版本号,例如
`sedona-spark-shaded-3.4_2.12`。
- ```scala
- import org.apache.sedona.spark.SedonaContext
+你也可以通过给 `spark-submit` 或 `spark-shell` 传入 `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` 来注册 Sedona。
- val sedona = SedonaContext.create(config)
- ```
+## 端到端教程
-=== "Java"
+本教程贯穿一份 2 波段 GeoTIFF —— 一小块 AOI 上的红光与近红外反射率影像 —— 让它走完一条典型的栅格处理流水线。场景影像在 Python
中合成,因此整个示例可重复运行,仓库无需新增任何二进制数据。对真实的 Sentinel-2 切片,下面同样的 SQL 不用改一行就能跑,仅输入路径需要换一下。
- ```java
- import org.apache.sedona.spark.SedonaContext;
+
- SparkSession sedona = SedonaContext.create(config)
- ```
+??? example "真实栅格长什么样"
-=== "Python"
+ 上面这套代码可处理任何符合 GeoTIFF 规范的数据。下面两个示例来自 Sedona 自带的测试资源:
- ```python
- from sedona.spark import *
+ | 3 波段彩色栅格 | 单波段栅格 |
+ | :--- | :--- |
+ |  |
 |
- sedona = SedonaContext.create(config)
- ```
+ 它们的 `RS_NumBands(rast)` 分别返回 `3` 和 `1`。`RS_Band(rast,
ARRAY(1,2,3))`、`RS_MapAlgebra` 等波段级函数在两类栅格上用法一致。
-也可以通过 `spark-submit` / `spark-shell` 传入 `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` 一并完成注册。
+### 1. 准备输入场景
-## 加载 GeoTiff 数据
+合成一份 256 × 256 的栅格,其中包含一块圆形植被田。真实工作流跳过这一步,直接让 Sedona 读取磁盘或对象存储上已有的 GeoTIFF。
-加载 GeoTiff 栅格数据的推荐方式是 `raster` 数据源。它会加载 GeoTiff 文件并自动将其切分为较小的 tile,每个 tile 在结果
DataFrame 中作为一行,并以 `Raster` 类型存储。
+```python
+import os
+import numpy as np
+import rasterio
+from rasterio.transform import from_bounds
+
+WORK = "/tmp/sedona-raster-tutorial"
+os.makedirs(WORK, exist_ok=True)
+
+AOI = (-91.10, 41.50, -91.00, 41.60) # xmin, ymin, xmax, ymax,EPSG:4326
+W = H = 256
+transform = from_bounds(*AOI, W, H)
+rng = np.random.default_rng(42)
+
+ys, xs = np.mgrid[0:H, 0:W]
+field = ((xs - 96) ** 2 + (ys - 160) ** 2) < 60**2 # 圆形植被田
+
+red = (1500 + 200 * rng.standard_normal((H, W))).clip(0,
10000).astype("uint16")
+nir = (1800 + 200 * rng.standard_normal((H, W))).clip(0, 10000)
+nir = np.where(field, nir + 4000, nir).astype("uint16")
+
+with rasterio.open(
+ f"{WORK}/scene.tif",
+ "w",
+ driver="GTiff",
+ tiled=True,
+ blockxsize=256,
+ blockysize=256,
+ height=H,
+ width=W,
+ count=2,
+ dtype="uint16",
+ crs="EPSG:4326",
+ transform=transform,
+) as dst:
+ dst.write(red, 1)
+ dst.set_band_description(1, "red")
+ dst.write(nir, 2)
+ dst.set_band_description(2, "nir")
+```
+
+### 2. 使用 `raster` 数据源加载
+
+`raster` 数据源可以加载 GeoTIFF 并自动将文件切成多个 tile。每一个 tile 在结果 DataFrame 中对应一行,`Raster`
类型保存于其中的一列。
=== "Scala"
- ```scala
- var rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
- rasterDf.createOrReplaceTempView("rasterDf")
- rasterDf.show()
- ```
+
+ ```scala
+ val rasterDf = sedona.read.format("raster").load(s"$WORK/scene.tif")
+ rasterDf.createOrReplaceTempView("rasterDf")
+ rasterDf.show()
+ ```
=== "Java"
- ```java
- Dataset<Row> rasterDf =
sedona.read().format("raster").load("/some/path/*.tif");
- rasterDf.createOrReplaceTempView("rasterDf");
- rasterDf.show();
- ```
+
+ ```java
+ Dataset<Row> rasterDf = sedona.read().format("raster").load(WORK +
"/scene.tif");
+ rasterDf.createOrReplaceTempView("rasterDf");
+ rasterDf.show();
+ ```
=== "Python"
- ```python
- rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
- rasterDf.createOrReplaceTempView("rasterDf")
- rasterDf.show()
- ```
-输出大致如下:
+ ```python
+ rasterDf = sedona.read.format("raster").load(f"{WORK}/scene.tif")
+ rasterDf.createOrReplaceTempView("rasterDf")
+ rasterDf.show()
+ ```
```
-+--------------------+---+---+----+
-| rast| x| y|name|
-+--------------------+---+---+----+
-|GridCoverage2D["g...| 0| 0| ...|
-|GridCoverage2D["g...| 1| 0| ...|
-|GridCoverage2D["g...| 2| 0| ...|
-...
++--------------------+---+---+----------+
+| rast| x| y| name|
++--------------------+---+---+----------+
+|GridCoverage2D["g...| 0| 0| scene.tif|
++--------------------+---+---+----------+
```
各列含义:
-- `rast`:以 `Raster` 类型表示的栅格数据。
-- `x`:tile 的 X 坐标(从 0 开始);只有未禁用 retile 时才会出现。
-- `y`:tile 的 Y 坐标(从 0 开始);同上。
-- `name`:栅格文件名。
-
-### Tiling 选项
-
-默认情况下启用 tiling(`retile = true`),且 tile 大小由 GeoTiff 的内部 tile 方案决定,无需手动指定
`tileWidth` 或 `tileHeight`。建议栅格数据使用 [Cloud Optimized GeoTIFF
(COG)](https://www.cogeo.org/) 格式,因为它通常会把像素数据组织为正方形 tile。
+- `rast` —— 栅格数据,Sedona 内置 `Raster` 类型。
+- `x`、`y` —— 当前 tile 在源文件内的 0 基索引(仅在启用 retile 时出现)。
+- `name` —— 源文件名。
-也可以可选地覆盖 tile 大小,或完全禁用 tiling:
-
-| 选项 | 默认值 | 说明 |
-| :--- | :--- | :--- |
-| `retile` | `true` | 是否启用 tiling。设为 `false` 则把整张栅格作为一行加载。 |
-| `tileWidth` | GeoTiff 内部 tile 宽度 | 可选。手动指定每个 tile 的宽度(像素)。 |
-| `tileHeight` | 若设置了 `tileWidth` 则与之相同,否则使用 GeoTiff 内部 tile 高度 | 可选。手动指定每个
tile 的高度(像素)。 |
-| `padWithNoData` | `false` | 当右、下边缘的 tile 小于指定大小时,使用 NODATA 值进行填充。 |
-
-覆盖 tile 大小:
-
-=== "Python"
- ```python
- rasterDf = (
- sedona.read.format("raster")
- .option("tileWidth", "256")
- .option("tileHeight", "256")
- .load("/some/path/*.tif")
- )
- ```
+256 × 256 的场景在这里恰好放在一个 tile 中,所以 DataFrame 只有一行。一份几 GB 的 GeoTIFF 则会产生很多行 ——
下游同一份 SQL 在两种情况下都能工作。
-!!!note
- 如果栅格数据的内部 tile 方案不利于切分,`raster` 数据源会抛出错误。可以通过 `option("retile", "false")`
关闭自动切分,或手动指定 tile 大小绕过该问题。更彻底的做法是使用 `gdal_translate` 等工具把数据转换为 COG 格式。
+
-### 从目录中加载栅格文件
+更多内容见下文的 [加载选项](#loading-options):包括 tile 大小覆盖、目录递归读取、以及 NetCDF / Arc Grid 等非
GeoTIFF 格式。
-`raster` 数据源也支持 Spark 通用文件源选项,如 `option("pathGlobFilter", "*.tif*")` 与
`option("recursiveFileLookup", "true")`。例如,递归加载某目录下所有 `.tif` 文件:
+### 3. 检视元数据
-=== "Python"
- ```python
- rasterDf = (
- sedona.read.format("raster")
- .option("recursiveFileLookup", "true")
- .option("pathGlobFilter", "*.tif*")
- .load(path_to_raster_data_folder)
- )
- ```
+在做任何处理之前,先确认像素维度、地理参考与坐标系:
-!!!tip
- 当传入路径以 `/` 结尾时,`raster` 数据源会递归地在该目录及其子目录中查找栅格文件。这等价于不带尾部 `/` 并设置
`option("recursiveFileLookup", "true")`。
+```python
+sedona.sql("""
+ SELECT RS_Width(rast) AS width,
+ RS_Height(rast) AS height,
+ RS_NumBands(rast) AS bands,
+ RS_SRID(rast) AS srid,
+ RS_GeoReference(rast) AS world_file
+ FROM rasterDf
+""").show(truncate=False)
+```
Review Comment:
这里用 `.show(truncate=False)` 展示 `RS_GeoReference` 会把换行显示为
`\n`(示例输出即如此),容易让读者误以为 world file 是一行字符串。建议补一句说明,或给出 `collect()`/`print()`
的替代写法(RS_GeoReference 文档里有示例)。
##########
docs/tutorial/raster.zh.md:
##########
@@ -90,621 +62,634 @@ SedonaSQL 详细 API 说明请参阅 [SedonaSQL
API](../api/sql/Overview.md)。
import org.apache.sedona.spark.SedonaContext;
SparkSession config = SedonaContext.builder()
- .master("local[*]") // 集群模式下请删除此行
- .appName("readTestScala") // 改成合适的名字
- .getOrCreate()
- ```
- 如果同时使用 SedonaViz 与 SedonaSQL,请在 `SedonaContext.builder()` 之后追加以下行启用
Sedona Kryo 序列化器:
- ```scala
- .config("spark.kryo.registrator",
SedonaVizKryoRegistrator.class.getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
+ .master("local[*]")
+ .appName("rasterTutorial")
+ .getOrCreate();
+ SparkSession sedona = SedonaContext.create(config);
```
=== "Python"
```python
- from sedona.spark import *
-
- config = SedonaContext.builder() .\
- config('spark.jars.packages',
- 'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},'
- 'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools
}}'). \
- getOrCreate()
+ from sedona.spark import SedonaContext
+
+ config = (
+ SedonaContext.builder()
+ .config(
+ "spark.jars.packages",
+ "org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},"
+ "org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}",
+ )
+ .getOrCreate()
+ )
+ sedona = SedonaContext.create(config)
```
- 请将 sedona-spark-shaded 包名中的 `3.3` 替换为对应的 Spark major.minor 版本,例如
`sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`。
-
-## 初始化 SedonaContext
-
-在创建 Sedona 配置之后加上以下代码。如果您已经有了由 Wherobots / AWS EMR / Databricks 创建的
SparkSession(通常名为 `spark`),请改为调用 `SedonaContext.create(spark)`。
-
-=== "Scala"
+ 请将 `sedona-spark-shaded-3.3` 中的 `3.3` 替换为对应的 Spark 主.次版本号,例如
`sedona-spark-shaded-3.4_2.12`。
- ```scala
- import org.apache.sedona.spark.SedonaContext
+你也可以通过给 `spark-submit` 或 `spark-shell` 传入 `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` 来注册 Sedona。
- val sedona = SedonaContext.create(config)
- ```
+## 端到端教程
-=== "Java"
+本教程贯穿一份 2 波段 GeoTIFF —— 一小块 AOI 上的红光与近红外反射率影像 —— 让它走完一条典型的栅格处理流水线。场景影像在 Python
中合成,因此整个示例可重复运行,仓库无需新增任何二进制数据。对真实的 Sentinel-2 切片,下面同样的 SQL 不用改一行就能跑,仅输入路径需要换一下。
- ```java
- import org.apache.sedona.spark.SedonaContext;
+
- SparkSession sedona = SedonaContext.create(config)
- ```
+??? example "真实栅格长什么样"
-=== "Python"
+ 上面这套代码可处理任何符合 GeoTIFF 规范的数据。下面两个示例来自 Sedona 自带的测试资源:
- ```python
- from sedona.spark import *
+ | 3 波段彩色栅格 | 单波段栅格 |
+ | :--- | :--- |
+ |  |
 |
- sedona = SedonaContext.create(config)
- ```
+ 它们的 `RS_NumBands(rast)` 分别返回 `3` 和 `1`。`RS_Band(rast,
ARRAY(1,2,3))`、`RS_MapAlgebra` 等波段级函数在两类栅格上用法一致。
-也可以通过 `spark-submit` / `spark-shell` 传入 `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` 一并完成注册。
+### 1. 准备输入场景
-## 加载 GeoTiff 数据
+合成一份 256 × 256 的栅格,其中包含一块圆形植被田。真实工作流跳过这一步,直接让 Sedona 读取磁盘或对象存储上已有的 GeoTIFF。
-加载 GeoTiff 栅格数据的推荐方式是 `raster` 数据源。它会加载 GeoTiff 文件并自动将其切分为较小的 tile,每个 tile 在结果
DataFrame 中作为一行,并以 `Raster` 类型存储。
+```python
+import os
+import numpy as np
+import rasterio
+from rasterio.transform import from_bounds
+
+WORK = "/tmp/sedona-raster-tutorial"
+os.makedirs(WORK, exist_ok=True)
+
+AOI = (-91.10, 41.50, -91.00, 41.60) # xmin, ymin, xmax, ymax,EPSG:4326
+W = H = 256
+transform = from_bounds(*AOI, W, H)
+rng = np.random.default_rng(42)
+
+ys, xs = np.mgrid[0:H, 0:W]
+field = ((xs - 96) ** 2 + (ys - 160) ** 2) < 60**2 # 圆形植被田
+
+red = (1500 + 200 * rng.standard_normal((H, W))).clip(0,
10000).astype("uint16")
+nir = (1800 + 200 * rng.standard_normal((H, W))).clip(0, 10000)
+nir = np.where(field, nir + 4000, nir).astype("uint16")
+
+with rasterio.open(
+ f"{WORK}/scene.tif",
+ "w",
+ driver="GTiff",
+ tiled=True,
+ blockxsize=256,
+ blockysize=256,
+ height=H,
+ width=W,
+ count=2,
+ dtype="uint16",
+ crs="EPSG:4326",
+ transform=transform,
+) as dst:
+ dst.write(red, 1)
+ dst.set_band_description(1, "red")
+ dst.write(nir, 2)
+ dst.set_band_description(2, "nir")
+```
+
+### 2. 使用 `raster` 数据源加载
+
+`raster` 数据源可以加载 GeoTIFF 并自动将文件切成多个 tile。每一个 tile 在结果 DataFrame 中对应一行,`Raster`
类型保存于其中的一列。
=== "Scala"
- ```scala
- var rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
- rasterDf.createOrReplaceTempView("rasterDf")
- rasterDf.show()
- ```
+
+ ```scala
+ val rasterDf = sedona.read.format("raster").load(s"$WORK/scene.tif")
+ rasterDf.createOrReplaceTempView("rasterDf")
+ rasterDf.show()
+ ```
=== "Java"
- ```java
- Dataset<Row> rasterDf =
sedona.read().format("raster").load("/some/path/*.tif");
- rasterDf.createOrReplaceTempView("rasterDf");
- rasterDf.show();
- ```
+
+ ```java
+ Dataset<Row> rasterDf = sedona.read().format("raster").load(WORK +
"/scene.tif");
+ rasterDf.createOrReplaceTempView("rasterDf");
+ rasterDf.show();
+ ```
=== "Python"
- ```python
- rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
- rasterDf.createOrReplaceTempView("rasterDf")
- rasterDf.show()
- ```
-输出大致如下:
+ ```python
+ rasterDf = sedona.read.format("raster").load(f"{WORK}/scene.tif")
+ rasterDf.createOrReplaceTempView("rasterDf")
+ rasterDf.show()
+ ```
```
-+--------------------+---+---+----+
-| rast| x| y|name|
-+--------------------+---+---+----+
-|GridCoverage2D["g...| 0| 0| ...|
-|GridCoverage2D["g...| 1| 0| ...|
-|GridCoverage2D["g...| 2| 0| ...|
-...
++--------------------+---+---+----------+
+| rast| x| y| name|
++--------------------+---+---+----------+
+|GridCoverage2D["g...| 0| 0| scene.tif|
++--------------------+---+---+----------+
```
各列含义:
-- `rast`:以 `Raster` 类型表示的栅格数据。
-- `x`:tile 的 X 坐标(从 0 开始);只有未禁用 retile 时才会出现。
-- `y`:tile 的 Y 坐标(从 0 开始);同上。
-- `name`:栅格文件名。
-
-### Tiling 选项
-
-默认情况下启用 tiling(`retile = true`),且 tile 大小由 GeoTiff 的内部 tile 方案决定,无需手动指定
`tileWidth` 或 `tileHeight`。建议栅格数据使用 [Cloud Optimized GeoTIFF
(COG)](https://www.cogeo.org/) 格式,因为它通常会把像素数据组织为正方形 tile。
+- `rast` —— 栅格数据,Sedona 内置 `Raster` 类型。
+- `x`、`y` —— 当前 tile 在源文件内的 0 基索引(仅在启用 retile 时出现)。
+- `name` —— 源文件名。
-也可以可选地覆盖 tile 大小,或完全禁用 tiling:
-
-| 选项 | 默认值 | 说明 |
-| :--- | :--- | :--- |
-| `retile` | `true` | 是否启用 tiling。设为 `false` 则把整张栅格作为一行加载。 |
-| `tileWidth` | GeoTiff 内部 tile 宽度 | 可选。手动指定每个 tile 的宽度(像素)。 |
-| `tileHeight` | 若设置了 `tileWidth` 则与之相同,否则使用 GeoTiff 内部 tile 高度 | 可选。手动指定每个
tile 的高度(像素)。 |
-| `padWithNoData` | `false` | 当右、下边缘的 tile 小于指定大小时,使用 NODATA 值进行填充。 |
-
-覆盖 tile 大小:
-
-=== "Python"
- ```python
- rasterDf = (
- sedona.read.format("raster")
- .option("tileWidth", "256")
- .option("tileHeight", "256")
- .load("/some/path/*.tif")
- )
- ```
+256 × 256 的场景在这里恰好放在一个 tile 中,所以 DataFrame 只有一行。一份几 GB 的 GeoTIFF 则会产生很多行 ——
下游同一份 SQL 在两种情况下都能工作。
-!!!note
- 如果栅格数据的内部 tile 方案不利于切分,`raster` 数据源会抛出错误。可以通过 `option("retile", "false")`
关闭自动切分,或手动指定 tile 大小绕过该问题。更彻底的做法是使用 `gdal_translate` 等工具把数据转换为 COG 格式。
+
-### 从目录中加载栅格文件
+更多内容见下文的 [加载选项](#loading-options):包括 tile 大小覆盖、目录递归读取、以及 NetCDF / Arc Grid 等非
GeoTIFF 格式。
-`raster` 数据源也支持 Spark 通用文件源选项,如 `option("pathGlobFilter", "*.tif*")` 与
`option("recursiveFileLookup", "true")`。例如,递归加载某目录下所有 `.tif` 文件:
+### 3. 检视元数据
-=== "Python"
- ```python
- rasterDf = (
- sedona.read.format("raster")
- .option("recursiveFileLookup", "true")
- .option("pathGlobFilter", "*.tif*")
- .load(path_to_raster_data_folder)
- )
- ```
+在做任何处理之前,先确认像素维度、地理参考与坐标系:
-!!!tip
- 当传入路径以 `/` 结尾时,`raster` 数据源会递归地在该目录及其子目录中查找栅格文件。这等价于不带尾部 `/` 并设置
`option("recursiveFileLookup", "true")`。
+```python
+sedona.sql("""
+ SELECT RS_Width(rast) AS width,
+ RS_Height(rast) AS height,
+ RS_NumBands(rast) AS bands,
+ RS_SRID(rast) AS srid,
+ RS_GeoReference(rast) AS world_file
+ FROM rasterDf
+""").show(truncate=False)
+```
-!!!tip
- 加载栅格之后,可以在 Jupyter Notebook 中通过 `SedonaUtils.display_image(df)`
快速预览:它会自动识别栅格列并以图像形式渲染。详情请参阅
[栅格可视化文档](../api/sql/Raster-Functions.md#raster-output)。
+```
++-----+------+-----+----+----------------------------------------------------------+
+|width|height|bands|srid|world_file
|
++-----+------+-----+----+----------------------------------------------------------+
+|256 |256 |2
|4326|0.000391\n0.000000\n0.000000\n-0.000391\n-91.099805\n41.599805|
++-----+------+-----+----+----------------------------------------------------------+
+```
-## 加载非 GeoTiff 数据(NetCDF、Arc Grid)
+[`RS_MetaData`](../api/sql/Raster-Operators/RS_MetaData.md)
把同样的信息以单个数组返回:`[upperLeftX, upperLeftY, width, height, scaleX, scaleY, skewX,
skewY, srid, numBands]`。
-对于 NetCDF、Arc Info ASCII Grid 等非 GeoTiff 栅格格式,请配合 Spark 内置的 `binaryFile` 数据源与
Sedona 的栅格构造器一起使用。
+地理参考字段共同定义了从像素空间到世界坐标的仿射变换:
-### 步骤 1:加载为 binary DataFrame
+
-=== "Scala"
- ```scala
- var rawDf = sedona.read.format("binaryFile").load(path_to_raster_data)
- rawDf.createOrReplaceTempView("rawdf")
- rawDf.show()
- ```
+每个访问器的详细说明见 [栅格元数据参考](#raster-metadata-reference);运行时的像素—世界坐标互转使用
[`RS_PixelAsPoint`](../api/sql/Pixel-Functions/RS_PixelAsPoint.md) 与
[`RS_WorldToRasterCoord`](../api/sql/Raster-Accessors/RS_WorldToRasterCoord.md)。
-=== "Java"
- ```java
- Dataset<Row> rawDf =
sedona.read().format("binaryFile").load(path_to_raster_data);
- rawDf.createOrReplaceTempView("rawdf");
- rawDf.show();
- ```
+### 4. 可视化原始栅格
-=== "Python"
- ```python
- rawDf = sedona.read.format("binaryFile").load(path_to_raster_data)
- rawDf.createOrReplaceTempView("rawdf")
- rawDf.show()
- ```
+先把两个波段渲染出来,看看处理前的输入是什么样子。`SedonaUtils.display_image` 在 Jupyter notebook
中会自动检测栅格列并就地渲染:
-输出大致如下:
+```python
+from sedona.spark import SedonaUtils
-```
-| path| modificationTime|length| content|
-+--------------------+--------------------+------+--------------------+
-|file:/Download/ra...|2023-09-06 16:24:...|174803|[49 49 2A 00 08 0...|
+SedonaUtils.display_image(
+ sedona.sql("SELECT RS_Band(rast, ARRAY(1)) AS rast FROM rasterDf")
+)
+SedonaUtils.display_image(
+ sedona.sql("SELECT RS_Band(rast, ARRAY(2)) AS rast FROM rasterDf")
+)
```
-如需加载多个栅格文件,可递归加载:
+波段 1 是红光通道 —— 大体是没什么特征的裸地。波段 2 (NIR) 在植被田上明显变亮:
-=== "Python"
- ```python
- rawDf = (
- sedona.read.format("binaryFile")
- .option("recursiveFileLookup", "true")
- .option("pathGlobFilter", "*.asc*")
- .load(path_to_raster_data_folder)
- )
- rawDf.createOrReplaceTempView("rawdf")
- rawDf.show()
- ```
+| 波段 1(红光) | 波段 2 (NIR) |
+| :--- | :--- |
+|  |  |
-### 步骤 2:创建 Raster 类型列
+如果不在 notebook 环境,使用 [`RS_AsImage(rast,
width)`](../api/sql/Raster-Output/RS_AsImage.md) 得到 HTML `<img>` 标签,或者用
[`RS_AsBase64`](../api/sql/Raster-Output/RS_AsBase64.md) 得到任意图像查看器都能解码的 Base64
字符串。
-SedonaSQL 中所有栅格运算都要求 Raster 类型对象,可使用以下任一构造器:
+### 5. 处理 —— 用 map algebra 计算 NDVI
-#### 从 GeoTiff 创建
+归一化植被指数(NDVI)可以把活体植被从其他地物中分离出来:
-```sql
-SELECT RS_FromGeoTiff(content) AS rast, modificationTime, length, path FROM
rawdf
```
-
-#### 从 Arc Grid 创建
-
-```sql
-SELECT RS_FromArcInfoAsciiGrid(content) AS rast, modificationTime, length,
path FROM rawdf
+NDVI = (NIR − Red) / (NIR + Red)
```
-#### 从 NetCDF 创建
-
-加载 NetCDF 文件的方法详见
[RS_FromNetCDF](../api/sql/Raster-Constructors/RS_FromNetCDF.md)。
-
-校验栅格列是否创建成功:
+[`RS_MapAlgebra`](../api/sql/Raster-map-algebra.md) 在一个或多个波段上按像素运行一段脚本。输出类型
`'D'`(double)保留 NDVI 的负值范围:
```python
-rasterDf.printSchema()
-```
-
-```
-root
- |-- rast: raster (nullable = true)
- |-- modificationTime: timestamp (nullable = true)
- |-- length: long (nullable = true)
- |-- path: string (nullable = true)
+ndviDf = sedona.sql("""
+ SELECT RS_MapAlgebra(
+ rast, 'D',
+ 'out[0] = (rast[1] - rast[0]) / (rast[1] + rast[0] + 1e-6);'
+ ) AS rast
+ FROM rasterDf
+""")
+ndviDf.createOrReplaceTempView("ndviDf")
```
-## 栅格的元数据
+
-Sedona 提供了获取栅格元数据的函数,以及获取栅格 world file 的函数。
+Map algebra 是最通用的处理原语 —— 裁剪、遮罩、阈值过滤、不同波段或不同栅格之间的算术运算,都可以用同样的
`RS_MapAlgebra(rast, pixelType, script)`(或双栅格版本)写出来。脚本语法见 [Map
algebra](../api/sql/Raster-map-algebra.md);其他备选算子(`RS_Clip`、`RS_Resample`、`RS_SetValues`)见下文的
[栅格处理参考](#raster-processing-reference)。
-### 元数据
+### 6. 可视化处理结果
-该函数返回一个数组,包含栅格的所有必要信息,详见
[RS_MetaData](../api/sql/Raster-Operators/RS_MetaData.md)。
+NDVI 栅格让植被田一目了然:NDVI 高的像素显示为绿色,其余为淡红色。
-```sql
-SELECT RS_MetaData(rast) FROM rasterDf
+```python
+SedonaUtils.display_image(ndviDf)
```
-输出形如:
-
-```
-[-1.3095817809482181E7, 4021262.7487925636, 512.0, 517.0, 72.32861272132695,
-72.32861272132695, 0.0, 0.0, 3857.0, 1.0]
-```
+
-数组中前两个元素是栅格图像左上像素的真实地理坐标(如经纬度),接下来的两个元素是栅格的像素维度。
+### 7. 使用分区统计聚合到矢量区域
-### World File
+逐像素的 NDVI 很少是最终的可交付成果。真正想回答的问题往往是「哪一块*区域*变绿了?」——
哪个农业地块、哪个普查区、哪个流域。[`RS_ZonalStats(raster, zone,
statType)`](../api/sql/Raster-Band-Accessors/RS_ZonalStats.md) 就是栅格 →
矢量聚合的规范函数:落在区域多边形内的每个像素都会贡献到该统计值上。
-[world file](https://en.wikipedia.org/wiki/World_file) 有 GDAL 与 ESRI 两种
georeference。详情参见
[RS_GeoReference](../api/sql/Raster-Accessors/RS_GeoReference.md)。
+真实的地块边界往往是不规则的 —— 形状各异的田地、地块之间留出的道路与地役权间隙、不属于任何区域的留白。下面在 AOI 上手绘 5 个地块:
-```sql
-SELECT RS_GeoReference(rast, "ESRI") FROM rasterDf
-```
+```python
+from pyspark.sql import Row
+
+parcels = sedona.createDataFrame(
+ [
+ Row(
+ parcel_id="Orchard",
+ wkt="POLYGON((-91.085 41.515, -91.045 41.510, -91.030 41.530, "
+ "-91.040 41.560, -91.075 41.572, -91.085 41.555, -91.085 41.515))",
+ ),
+ Row(
+ parcel_id="EastFarm",
+ wkt="POLYGON((-91.025 41.512, -91.005 41.512, -91.005 41.572, "
+ "-91.035 41.572, -91.025 41.535, -91.025 41.512))",
+ ),
+ Row(
+ parcel_id="WestFarm",
+ wkt="POLYGON((-91.095 41.520, -91.087 41.520, -91.080 41.555, "
+ "-91.080 41.572, -91.095 41.572, -91.095 41.520))",
+ ),
+ Row(
+ parcel_id="NorthBlock",
+ wkt="POLYGON((-91.095 41.580, -91.005 41.580, -91.005 41.598, "
+ "-91.095 41.598, -91.095 41.580))",
+ ),
+ Row(
+ parcel_id="SouthStrip",
+ wkt="POLYGON((-91.095 41.502, -91.005 41.502, -91.005 41.508, "
+ "-91.095 41.508, -91.095 41.502))",
+ ),
+ ]
+).selectExpr("parcel_id", "ST_GeomFromText(wkt) AS geom")
+
+parcels.createOrReplaceTempView("parcels")
+
+ranked = sedona.sql("""
+ SELECT p.parcel_id,
+ ROUND(RS_ZonalStats(n.rast, p.geom, 'mean'), 4) AS mean_ndvi
+ FROM parcels p, ndviDf n
+ ORDER BY mean_ndvi DESC
+""")
+ranked.show()
+```
+
+```
++----------+---------+
+| parcel_id|mean_ndvi|
++----------+---------+
+| Orchard| 0.4213|
+| WestFarm| 0.1182|
+|NorthBlock| 0.0925|
+| EastFarm| 0.0907|
+|SouthStrip| 0.0905|
++----------+---------+
+```
+
+不规则形状的 **Orchard** 地块明显胜出 —— 它正好覆盖了植被田。落在地块缝隙里的像素(道路、未登记土地)不属于任何区域,对所有统计值都没有影响。
+
+
-输出如下:
+!!!note
+ 当输入栅格被分成多个 tile 时,`parcels × ndviDf` 的笛卡尔连接会产生每个 `(parcel, tile)` 一行的结果。要在
tile 之间正确聚合,应该按 tile 先算 `sum` 与 `count`,再 `GROUP BY parcel_id` 用 `SUM(sum) /
SUM(count)`
汇总。思路相同,只是多加一层聚合。[`RS_ZonalStatsAll`](../api/sql/Raster-Band-Accessors/RS_ZonalStatsAll.md)
可以在一次调用里返回所有常用统计量。
-```
-72.328613
-0.000000
-0.000000
--72.328613
--13095781.645176
-4021226.584486
-```
+### 8. 写回磁盘
-world file 用于通过建立图像到世界坐标的变换,把真实地理坐标对应到图像像素,从而实现 georeference 与 geolocate。
+写出栅格分两步:先用 `RS_AsXXX` 把 `Raster` 列转成二进制,再把这个二进制 DataFrame 交给 Sedona 的 `raster`
writer。
-## 栅格操作
+
-自 `v1.5.0` 起 Sedona 增加了大量栅格操作能力。下面给出几个示例查询。
+=== "Scala"
-!!!note
- 更多栅格操作请参阅 [SedonaSQL
栅格算子](../api/sql/Raster-Functions.md#raster-operators)。
+ ```scala
+ import org.apache.spark.sql.functions.expr
Review Comment:
Scala 写出示例使用了 `s"$WORK/ndvi_out"`,但 `WORK` 只在 Python 片段中定义。建议在 Scala 片段中也定义
`WORK`(或直接写输出路径),否则示例无法直接运行。
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]