Re: [PR] [GH-2769] Improve raster loader, writer, and viz docs [sedona]

via GitHub Sun, 29 Mar 2026 15:45:43 -0700


Copilot commented on code in PR #2802:
URL: https://github.com/apache/sedona/pull/2802#discussion_r3006881067



##########
docs/tutorial/raster.md:
##########
@@ -142,68 +142,121 @@ Add the following line after creating the Sedona config. 
If you already have a S
 
 You can also register everything by passing `--conf 
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to 
`spark-submit` or `spark-shell`.
 
-## Load data from files
+## Load GeoTiff data
 
-Assume we have a single raster data file called rasterData.tiff, [at 
Path](https://github.com/apache/sedona/blob/0eae42576c2588fe278f75cef3b17fee600eac90/spark/common/src/test/resources/raster/raster_with_no_data/test5.tiff).
-
-Use the following code to load the data and create a raw Dataframe.
+The recommended way to load GeoTiff raster data is the `raster` data source. 
It loads GeoTiff files and automatically splits them into smaller tiles. Each 
tile becomes a row in the resulting DataFrame stored in `Raster` format.
 
 === "Scala"
     ```scala
-    var rawDf = sedona.read.format("binaryFile").load(path_to_raster_data)
-    rawDf.createOrReplaceTempView("rawdf")
-    rawDf.show()
+    var rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
+    rasterDf.createOrReplaceTempView("rasterDf")
+    rasterDf.show()
     ```
 
 === "Java"
     ```java
-    Dataset<Row> rawDf = 
sedona.read.format("binaryFile").load(path_to_raster_data)
-    rawDf.createOrReplaceTempView("rawdf")
-    rawDf.show()
+    Dataset<Row> rasterDf = 
sedona.read().format("raster").load("/some/path/*.tif");
+    rasterDf.createOrReplaceTempView("rasterDf");
+    rasterDf.show();
     ```
 
 === "Python"
     ```python
-    rawDf = sedona.read.format("binaryFile").load(path_to_raster_data)
-    rawDf.createOrReplaceTempView("rawdf")
-    rawDf.show()
+    rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
+    rasterDf.createOrReplaceTempView("rasterDf")
+    rasterDf.show()
     ```
 
 The output will look like this:
 
 ```
-|                path|    modificationTime|length|             content|
-+--------------------+--------------------+------+--------------------+
-|file:/Download/ra...|2023-09-06 16:24:...|174803|[49 49 2A 00 08 0...|
++--------------------+---+---+----+
+|                rast|  x|  y|name|
++--------------------+---+---+----+
+|GridCoverage2D["g...|  0|  0| ...|
+|GridCoverage2D["g...|  1|  0| ...|
+|GridCoverage2D["g...|  2|  0| ...|
+...
 ```
 
-For multiple raster data files use the following code to load the data [from 
path](https://github.com/apache/sedona/blob/0eae42576c2588fe278f75cef3b17fee600eac90/spark/common/src/test/resources/raster/)
 and create raw DataFrame.
+The output contains the following columns:
+
+- `rast`: The raster data in `Raster` format.
+- `x`: The 0-based x-coordinate of the tile. This column is only present when 
retile is not disabled.
+- `y`: The 0-based y-coordinate of the tile. This column is only present when 
retile is not disabled.
+- `name`: The name of the raster file.
+
+### Tiling options
+
+By default, tiling is enabled (`retile = true`) and the tile size is 
determined by the GeoTiff file's internal tiling scheme — you do not need to 
specify `tileWidth` or `tileHeight`. It is recommended to use [Cloud Optimized 
GeoTIFF (COG)](https://www.cogeo.org/) format for raster data since they 
usually organize pixel data as square tiles.
+
+You can optionally override the tile size, or disable tiling entirely:
+
+| Option | Default | Description |
+| :--- | :--- | :--- |
+| `retile` | `true` | Whether to enable tiling. Set to `false` to load the 
entire raster as a single row. |
+| `tileWidth` | GeoTiff's internal tile width | Optional. Override the width 
of each tile in pixels. |
+| `tileHeight` | Same as `tileWidth` if set, otherwise GeoTiff's internal tile 
height | Optional. Override the height of each tile in pixels. |
+| `padWithNoData` | `false` | Pad the right and bottom tiles with NODATA 
values if they are smaller than the specified tile size. |
+
+To override the tile size:
+
+=== "Python"
+    ```python
+    rasterDf = (
+        sedona.read.format("raster")
+        .option("tileWidth", "256")
+        .option("tileHeight", "256")
+        .load("/some/path/*.tif")
+    )
+    ```
 
 !!!note
-    The above code works too for loading multiple raster data files. If the 
raster files are in separate directories and the option also makes sure that 
only `.tif` or `.tiff` files are being loaded.
+    If the internal tiling scheme of raster data is not friendly for tiling, 
the `raster` data source will throw an error, and you can disable automatic 
tiling using `option("retile", "false")`, or specify the tile size manually to 
workaround this issue. A better solution is to translate the raster data into 
COG format using `gdal_translate` or other tools.
+
+### Loading raster files from directories
+
+The `raster` data source also works with Spark generic file source options, 
such as `option("pathGlobFilter", "*.tif*")` and `option("recursiveFileLookup", 
"true")`. For instance, you can load all the `.tif` files recursively in a 
directory using:
+
+=== "Python"
+    ```python
+    rasterDf = (
+        sedona.read.format("raster")
+        .option("recursiveFileLookup", "true")
+        .option("pathGlobFilter", "*.tif*")
+        .load(path_to_raster_data_folder)
+    )
+    ```
+
+!!!tip
+    When the loaded path ends with `/`, the `raster` data source will look up 
raster files in the directory and all its subdirectories recursively. This is 
equivalent to specifying a path without trailing `/` and setting 
`option("recursiveFileLookup", "true")`.
+
+!!!tip
+    After loading rasters, you can quickly visualize them in a Jupyter 
notebook using `SedonaUtils.display_image(df)`. It automatically detects raster 
columns and renders them as images. See [Raster visualizer 
docs](../api/sql/Raster-Functions.md#raster-output) for details.
+
+## Load non-GeoTiff data (NetCDF, Arc Grid)
+
+For non-GeoTiff raster formats such as NetCDF or Arc Info ASCII Grid, use the 
Spark built-in `binaryFile` data source together with Sedona raster 
constructors.
+
+### Step 1: Load to a binary DataFrame
 
 === "Scala"
     ```scala
-    var rawDf = sedona.read.format("binaryFile").option("recursiveFileLookup", 
"true").option("pathGlobFilter", "*.tif*").load(path_to_raster_data_folder)
+    var rawDf = sedona.read.format("binaryFile").load(path_to_raster_data)
     rawDf.createOrReplaceTempView("rawdf")
     rawDf.show()
     ```
 
 === "Java"
     ```java
-    Dataset<Row> rawDf = 
sedona.read.format("binaryFile").option("recursiveFileLookup", 
"true").option("pathGlobFilter", "*.tif*").load(path_to_raster_data_folder);
+    Dataset<Row> rawDf = 
sedona.read.format("binaryFile").load(path_to_raster_data);

Review Comment:
   In the Java example for loading non-GeoTiff data, `sedona.read.format(...)` 
is inconsistent with the earlier Java snippet in this same page 
(`sedona.read().format(...)`) and won’t compile with a standard `SparkSession`. 
Use `sedona.read().format("binaryFile")` here for a copy/paste runnable example.
   ```suggestion
       Dataset<Row> rawDf = 
sedona.read().format("binaryFile").load(path_to_raster_data);
   ```



##########
docs/tutorial/raster.md:
##########
@@ -503,47 +565,93 @@ Please refer to [Raster visualizer 
docs](../api/sql/Raster-Functions.md#raster-o
 
 ## Save to permanent storage
 
-Sedona has APIs that can save an entire raster column to files in a specified 
location. Before saving, the raster type column needs to be converted to a 
binary format. Sedona provides several functions to convert a raster column 
into a binary column suitable for file storage. Once in binary format, the 
raster data can then be written to files on disk using the Sedona file storage 
APIs.
-
-```sparksql
-rasterDf.write.format("raster").option("rasterField", 
"raster").option("fileExtension", 
".tiff").mode(SaveMode.Overwrite).save(dirPath)
-```
+Saving raster data is a two-step process: (1) convert the Raster column to 
binary format using an `RS_AsXXX` function, and (2) write the binary DataFrame 
to files using Sedona's `raster` data source writer.
 
-Sedona has a few writer functions that create the binary DataFrame necessary 
for saving the raster images.
+### Step 1: Convert to binary format
 
-### As Arc Grid
+Choose one of the following output format functions:
 
-Use [RS_AsArcGrid](../api/sql/Raster-writer.md#rs_asarcgrid) to get the binary 
Dataframe of the raster in Arc Grid format.
+| Function | Format | Description |
+| :--- | :--- | :--- |
+| [RS_AsGeoTiff](../api/sql/Raster-Output/RS_AsGeoTiff.md) | GeoTiff | 
General-purpose raster format with optional compression |
+| [RS_AsCOG](../api/sql/Raster-Output/RS_AsCOG.md) | Cloud Optimized GeoTiff | 
Ideal for cloud storage with efficient range-read access |
+| [RS_AsArcGrid](../api/sql/Raster-Output/RS_AsArcGrid.md) | Arc Grid | 
ASCII-based format, single band only |
+| [RS_AsPNG](../api/sql/Raster-Output/RS_AsPNG.md) | PNG | Image format, 
unsigned integer pixel types only |
 
 ```sql
-SELECT RS_AsArcGrid(raster)
+SELECT RS_AsGeoTiff(rast) AS raster_binary FROM rasterDf
 ```
 
-### As GeoTiff
+### Step 2: Write to files
 
-Use [RS_AsGeoTiff](../api/sql/Raster-writer.md#rs_asgeotiff) to get the binary 
Dataframe of the raster in GeoTiff format.
+Use Sedona's built-in `raster` data source to write the binary DataFrame:
 
-```sql
-SELECT RS_AsGeoTiff(raster)
-```
+=== "Scala"
+    ```scala
+    rasterDf.withColumn("raster_binary", expr("RS_AsGeoTiff(rast)"))
+      .write.format("raster").mode("overwrite").save("my_raster_file")
+    ```
 
-### As Cloud Optimized GeoTiff
+=== "Python"
+    ```python
+    rasterDf.withColumn("raster_binary", 
expr("RS_AsGeoTiff(rast)")).write.format(
+        "raster"
+    ).mode("overwrite").save("my_raster_file")
+    ```

Review Comment:
   The “Write to files” examples use `expr("RS_AsGeoTiff(...)" )` in both Scala 
and Python snippets, but neither snippet imports/qualifies `expr`. Add the 
appropriate import (Scala: `org.apache.spark.sql.functions.expr`; Python: 
`pyspark.sql.functions.expr`) or rewrite the examples using `selectExpr` to 
keep them runnable as-is.



##########
docs/tutorial/raster.md:
##########
@@ -142,68 +142,121 @@ Add the following line after creating the Sedona config. 
If you already have a S
 
 You can also register everything by passing `--conf 
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to 
`spark-submit` or `spark-shell`.
 
-## Load data from files
+## Load GeoTiff data
 
-Assume we have a single raster data file called rasterData.tiff, [at 
Path](https://github.com/apache/sedona/blob/0eae42576c2588fe278f75cef3b17fee600eac90/spark/common/src/test/resources/raster/raster_with_no_data/test5.tiff).
-
-Use the following code to load the data and create a raw Dataframe.
+The recommended way to load GeoTiff raster data is the `raster` data source. 
It loads GeoTiff files and automatically splits them into smaller tiles. Each 
tile becomes a row in the resulting DataFrame stored in `Raster` format.
 
 === "Scala"
     ```scala
-    var rawDf = sedona.read.format("binaryFile").load(path_to_raster_data)
-    rawDf.createOrReplaceTempView("rawdf")
-    rawDf.show()
+    var rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
+    rasterDf.createOrReplaceTempView("rasterDf")
+    rasterDf.show()
     ```
 
 === "Java"
     ```java
-    Dataset<Row> rawDf = 
sedona.read.format("binaryFile").load(path_to_raster_data)
-    rawDf.createOrReplaceTempView("rawdf")
-    rawDf.show()
+    Dataset<Row> rasterDf = 
sedona.read().format("raster").load("/some/path/*.tif");
+    rasterDf.createOrReplaceTempView("rasterDf");
+    rasterDf.show();
     ```
 
 === "Python"
     ```python
-    rawDf = sedona.read.format("binaryFile").load(path_to_raster_data)
-    rawDf.createOrReplaceTempView("rawdf")
-    rawDf.show()
+    rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
+    rasterDf.createOrReplaceTempView("rasterDf")
+    rasterDf.show()
     ```
 
 The output will look like this:
 
 ```
-|                path|    modificationTime|length|             content|
-+--------------------+--------------------+------+--------------------+
-|file:/Download/ra...|2023-09-06 16:24:...|174803|[49 49 2A 00 08 0...|
++--------------------+---+---+----+
+|                rast|  x|  y|name|
++--------------------+---+---+----+
+|GridCoverage2D["g...|  0|  0| ...|
+|GridCoverage2D["g...|  1|  0| ...|
+|GridCoverage2D["g...|  2|  0| ...|
+...
 ```
 
-For multiple raster data files use the following code to load the data [from 
path](https://github.com/apache/sedona/blob/0eae42576c2588fe278f75cef3b17fee600eac90/spark/common/src/test/resources/raster/)
 and create raw DataFrame.
+The output contains the following columns:
+
+- `rast`: The raster data in `Raster` format.
+- `x`: The 0-based x-coordinate of the tile. This column is only present when 
retile is not disabled.
+- `y`: The 0-based y-coordinate of the tile. This column is only present when 
retile is not disabled.
+- `name`: The name of the raster file.
+
+### Tiling options
+
+By default, tiling is enabled (`retile = true`) and the tile size is 
determined by the GeoTiff file's internal tiling scheme — you do not need to 
specify `tileWidth` or `tileHeight`. It is recommended to use [Cloud Optimized 
GeoTIFF (COG)](https://www.cogeo.org/) format for raster data since they 
usually organize pixel data as square tiles.
+
+You can optionally override the tile size, or disable tiling entirely:
+
+| Option | Default | Description |
+| :--- | :--- | :--- |
+| `retile` | `true` | Whether to enable tiling. Set to `false` to load the 
entire raster as a single row. |
+| `tileWidth` | GeoTiff's internal tile width | Optional. Override the width 
of each tile in pixels. |
+| `tileHeight` | Same as `tileWidth` if set, otherwise GeoTiff's internal tile 
height | Optional. Override the height of each tile in pixels. |
+| `padWithNoData` | `false` | Pad the right and bottom tiles with NODATA 
values if they are smaller than the specified tile size. |
+
+To override the tile size:
+
+=== "Python"
+    ```python
+    rasterDf = (
+        sedona.read.format("raster")
+        .option("tileWidth", "256")
+        .option("tileHeight", "256")
+        .load("/some/path/*.tif")
+    )
+    ```
 
 !!!note
-    The above code works too for loading multiple raster data files. If the 
raster files are in separate directories and the option also makes sure that 
only `.tif` or `.tiff` files are being loaded.
+    If the internal tiling scheme of raster data is not friendly for tiling, 
the `raster` data source will throw an error, and you can disable automatic 
tiling using `option("retile", "false")`, or specify the tile size manually to 
workaround this issue. A better solution is to translate the raster data into 
COG format using `gdal_translate` or other tools.
+
+### Loading raster files from directories
+
+The `raster` data source also works with Spark generic file source options, 
such as `option("pathGlobFilter", "*.tif*")` and `option("recursiveFileLookup", 
"true")`. For instance, you can load all the `.tif` files recursively in a 
directory using:
+
+=== "Python"
+    ```python
+    rasterDf = (
+        sedona.read.format("raster")
+        .option("recursiveFileLookup", "true")
+        .option("pathGlobFilter", "*.tif*")
+        .load(path_to_raster_data_folder)
+    )
+    ```
+
+!!!tip
+    When the loaded path ends with `/`, the `raster` data source will look up 
raster files in the directory and all its subdirectories recursively. This is 
equivalent to specifying a path without trailing `/` and setting 
`option("recursiveFileLookup", "true")`.
+
+!!!tip
+    After loading rasters, you can quickly visualize them in a Jupyter 
notebook using `SedonaUtils.display_image(df)`. It automatically detects raster 
columns and renders them as images. See [Raster visualizer 
docs](../api/sql/Raster-Functions.md#raster-output) for details.

Review Comment:
   This tip references `SedonaUtils.display_image(df)` but the tutorial’s 
DataFrame is named `rasterDf` in the surrounding examples. Using the same 
variable name here would reduce confusion for readers following along.
   ```suggestion
       After loading rasters, you can quickly visualize them in a Jupyter 
notebook using `SedonaUtils.display_image(rasterDf)`. It automatically detects 
raster columns and renders them as images. See [Raster visualizer 
docs](../api/sql/Raster-Functions.md#raster-output) for details.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GH-2769] Improve raster loader, writer, and viz docs [sedona]

Reply via email to