Copilot commented on code in PR #2802: URL: https://github.com/apache/sedona/pull/2802#discussion_r3006881067
########## docs/tutorial/raster.md: ########## @@ -142,68 +142,121 @@ Add the following line after creating the Sedona config. If you already have a S You can also register everything by passing `--conf spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to `spark-submit` or `spark-shell`. -## Load data from files +## Load GeoTiff data -Assume we have a single raster data file called rasterData.tiff, [at Path](https://github.com/apache/sedona/blob/0eae42576c2588fe278f75cef3b17fee600eac90/spark/common/src/test/resources/raster/raster_with_no_data/test5.tiff). - -Use the following code to load the data and create a raw Dataframe. +The recommended way to load GeoTiff raster data is the `raster` data source. It loads GeoTiff files and automatically splits them into smaller tiles. Each tile becomes a row in the resulting DataFrame stored in `Raster` format. === "Scala" ```scala - var rawDf = sedona.read.format("binaryFile").load(path_to_raster_data) - rawDf.createOrReplaceTempView("rawdf") - rawDf.show() + var rasterDf = sedona.read.format("raster").load("/some/path/*.tif") + rasterDf.createOrReplaceTempView("rasterDf") + rasterDf.show() ``` === "Java" ```java - Dataset<Row> rawDf = sedona.read.format("binaryFile").load(path_to_raster_data) - rawDf.createOrReplaceTempView("rawdf") - rawDf.show() + Dataset<Row> rasterDf = sedona.read().format("raster").load("/some/path/*.tif"); + rasterDf.createOrReplaceTempView("rasterDf"); + rasterDf.show(); ``` === "Python" ```python - rawDf = sedona.read.format("binaryFile").load(path_to_raster_data) - rawDf.createOrReplaceTempView("rawdf") - rawDf.show() + rasterDf = sedona.read.format("raster").load("/some/path/*.tif") + rasterDf.createOrReplaceTempView("rasterDf") + rasterDf.show() ``` The output will look like this: ``` -| path| modificationTime|length| content| -+--------------------+--------------------+------+--------------------+ -|file:/Download/ra...|2023-09-06 16:24:...|174803|[49 49 2A 00 08 0...| ++--------------------+---+---+----+ +| rast| x| y|name| ++--------------------+---+---+----+ +|GridCoverage2D["g...| 0| 0| ...| +|GridCoverage2D["g...| 1| 0| ...| +|GridCoverage2D["g...| 2| 0| ...| +... ``` -For multiple raster data files use the following code to load the data [from path](https://github.com/apache/sedona/blob/0eae42576c2588fe278f75cef3b17fee600eac90/spark/common/src/test/resources/raster/) and create raw DataFrame. +The output contains the following columns: + +- `rast`: The raster data in `Raster` format. +- `x`: The 0-based x-coordinate of the tile. This column is only present when retile is not disabled. +- `y`: The 0-based y-coordinate of the tile. This column is only present when retile is not disabled. +- `name`: The name of the raster file. + +### Tiling options + +By default, tiling is enabled (`retile = true`) and the tile size is determined by the GeoTiff file's internal tiling scheme — you do not need to specify `tileWidth` or `tileHeight`. It is recommended to use [Cloud Optimized GeoTIFF (COG)](https://www.cogeo.org/) format for raster data since they usually organize pixel data as square tiles. + +You can optionally override the tile size, or disable tiling entirely: + +| Option | Default | Description | +| :--- | :--- | :--- | +| `retile` | `true` | Whether to enable tiling. Set to `false` to load the entire raster as a single row. | +| `tileWidth` | GeoTiff's internal tile width | Optional. Override the width of each tile in pixels. | +| `tileHeight` | Same as `tileWidth` if set, otherwise GeoTiff's internal tile height | Optional. Override the height of each tile in pixels. | +| `padWithNoData` | `false` | Pad the right and bottom tiles with NODATA values if they are smaller than the specified tile size. | + +To override the tile size: + +=== "Python" + ```python + rasterDf = ( + sedona.read.format("raster") + .option("tileWidth", "256") + .option("tileHeight", "256") + .load("/some/path/*.tif") + ) + ``` !!!note - The above code works too for loading multiple raster data files. If the raster files are in separate directories and the option also makes sure that only `.tif` or `.tiff` files are being loaded. + If the internal tiling scheme of raster data is not friendly for tiling, the `raster` data source will throw an error, and you can disable automatic tiling using `option("retile", "false")`, or specify the tile size manually to workaround this issue. A better solution is to translate the raster data into COG format using `gdal_translate` or other tools. + +### Loading raster files from directories + +The `raster` data source also works with Spark generic file source options, such as `option("pathGlobFilter", "*.tif*")` and `option("recursiveFileLookup", "true")`. For instance, you can load all the `.tif` files recursively in a directory using: + +=== "Python" + ```python + rasterDf = ( + sedona.read.format("raster") + .option("recursiveFileLookup", "true") + .option("pathGlobFilter", "*.tif*") + .load(path_to_raster_data_folder) + ) + ``` + +!!!tip + When the loaded path ends with `/`, the `raster` data source will look up raster files in the directory and all its subdirectories recursively. This is equivalent to specifying a path without trailing `/` and setting `option("recursiveFileLookup", "true")`. + +!!!tip + After loading rasters, you can quickly visualize them in a Jupyter notebook using `SedonaUtils.display_image(df)`. It automatically detects raster columns and renders them as images. See [Raster visualizer docs](../api/sql/Raster-Functions.md#raster-output) for details. + +## Load non-GeoTiff data (NetCDF, Arc Grid) + +For non-GeoTiff raster formats such as NetCDF or Arc Info ASCII Grid, use the Spark built-in `binaryFile` data source together with Sedona raster constructors. + +### Step 1: Load to a binary DataFrame === "Scala" ```scala - var rawDf = sedona.read.format("binaryFile").option("recursiveFileLookup", "true").option("pathGlobFilter", "*.tif*").load(path_to_raster_data_folder) + var rawDf = sedona.read.format("binaryFile").load(path_to_raster_data) rawDf.createOrReplaceTempView("rawdf") rawDf.show() ``` === "Java" ```java - Dataset<Row> rawDf = sedona.read.format("binaryFile").option("recursiveFileLookup", "true").option("pathGlobFilter", "*.tif*").load(path_to_raster_data_folder); + Dataset<Row> rawDf = sedona.read.format("binaryFile").load(path_to_raster_data); Review Comment: In the Java example for loading non-GeoTiff data, `sedona.read.format(...)` is inconsistent with the earlier Java snippet in this same page (`sedona.read().format(...)`) and won’t compile with a standard `SparkSession`. Use `sedona.read().format("binaryFile")` here for a copy/paste runnable example. ```suggestion Dataset<Row> rawDf = sedona.read().format("binaryFile").load(path_to_raster_data); ``` ########## docs/tutorial/raster.md: ########## @@ -503,47 +565,93 @@ Please refer to [Raster visualizer docs](../api/sql/Raster-Functions.md#raster-o ## Save to permanent storage -Sedona has APIs that can save an entire raster column to files in a specified location. Before saving, the raster type column needs to be converted to a binary format. Sedona provides several functions to convert a raster column into a binary column suitable for file storage. Once in binary format, the raster data can then be written to files on disk using the Sedona file storage APIs. - -```sparksql -rasterDf.write.format("raster").option("rasterField", "raster").option("fileExtension", ".tiff").mode(SaveMode.Overwrite).save(dirPath) -``` +Saving raster data is a two-step process: (1) convert the Raster column to binary format using an `RS_AsXXX` function, and (2) write the binary DataFrame to files using Sedona's `raster` data source writer. -Sedona has a few writer functions that create the binary DataFrame necessary for saving the raster images. +### Step 1: Convert to binary format -### As Arc Grid +Choose one of the following output format functions: -Use [RS_AsArcGrid](../api/sql/Raster-writer.md#rs_asarcgrid) to get the binary Dataframe of the raster in Arc Grid format. +| Function | Format | Description | +| :--- | :--- | :--- | +| [RS_AsGeoTiff](../api/sql/Raster-Output/RS_AsGeoTiff.md) | GeoTiff | General-purpose raster format with optional compression | +| [RS_AsCOG](../api/sql/Raster-Output/RS_AsCOG.md) | Cloud Optimized GeoTiff | Ideal for cloud storage with efficient range-read access | +| [RS_AsArcGrid](../api/sql/Raster-Output/RS_AsArcGrid.md) | Arc Grid | ASCII-based format, single band only | +| [RS_AsPNG](../api/sql/Raster-Output/RS_AsPNG.md) | PNG | Image format, unsigned integer pixel types only | ```sql -SELECT RS_AsArcGrid(raster) +SELECT RS_AsGeoTiff(rast) AS raster_binary FROM rasterDf ``` -### As GeoTiff +### Step 2: Write to files -Use [RS_AsGeoTiff](../api/sql/Raster-writer.md#rs_asgeotiff) to get the binary Dataframe of the raster in GeoTiff format. +Use Sedona's built-in `raster` data source to write the binary DataFrame: -```sql -SELECT RS_AsGeoTiff(raster) -``` +=== "Scala" + ```scala + rasterDf.withColumn("raster_binary", expr("RS_AsGeoTiff(rast)")) + .write.format("raster").mode("overwrite").save("my_raster_file") + ``` -### As Cloud Optimized GeoTiff +=== "Python" + ```python + rasterDf.withColumn("raster_binary", expr("RS_AsGeoTiff(rast)")).write.format( + "raster" + ).mode("overwrite").save("my_raster_file") + ``` Review Comment: The “Write to files” examples use `expr("RS_AsGeoTiff(...)" )` in both Scala and Python snippets, but neither snippet imports/qualifies `expr`. Add the appropriate import (Scala: `org.apache.spark.sql.functions.expr`; Python: `pyspark.sql.functions.expr`) or rewrite the examples using `selectExpr` to keep them runnable as-is. ########## docs/tutorial/raster.md: ########## @@ -142,68 +142,121 @@ Add the following line after creating the Sedona config. If you already have a S You can also register everything by passing `--conf spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to `spark-submit` or `spark-shell`. -## Load data from files +## Load GeoTiff data -Assume we have a single raster data file called rasterData.tiff, [at Path](https://github.com/apache/sedona/blob/0eae42576c2588fe278f75cef3b17fee600eac90/spark/common/src/test/resources/raster/raster_with_no_data/test5.tiff). - -Use the following code to load the data and create a raw Dataframe. +The recommended way to load GeoTiff raster data is the `raster` data source. It loads GeoTiff files and automatically splits them into smaller tiles. Each tile becomes a row in the resulting DataFrame stored in `Raster` format. === "Scala" ```scala - var rawDf = sedona.read.format("binaryFile").load(path_to_raster_data) - rawDf.createOrReplaceTempView("rawdf") - rawDf.show() + var rasterDf = sedona.read.format("raster").load("/some/path/*.tif") + rasterDf.createOrReplaceTempView("rasterDf") + rasterDf.show() ``` === "Java" ```java - Dataset<Row> rawDf = sedona.read.format("binaryFile").load(path_to_raster_data) - rawDf.createOrReplaceTempView("rawdf") - rawDf.show() + Dataset<Row> rasterDf = sedona.read().format("raster").load("/some/path/*.tif"); + rasterDf.createOrReplaceTempView("rasterDf"); + rasterDf.show(); ``` === "Python" ```python - rawDf = sedona.read.format("binaryFile").load(path_to_raster_data) - rawDf.createOrReplaceTempView("rawdf") - rawDf.show() + rasterDf = sedona.read.format("raster").load("/some/path/*.tif") + rasterDf.createOrReplaceTempView("rasterDf") + rasterDf.show() ``` The output will look like this: ``` -| path| modificationTime|length| content| -+--------------------+--------------------+------+--------------------+ -|file:/Download/ra...|2023-09-06 16:24:...|174803|[49 49 2A 00 08 0...| ++--------------------+---+---+----+ +| rast| x| y|name| ++--------------------+---+---+----+ +|GridCoverage2D["g...| 0| 0| ...| +|GridCoverage2D["g...| 1| 0| ...| +|GridCoverage2D["g...| 2| 0| ...| +... ``` -For multiple raster data files use the following code to load the data [from path](https://github.com/apache/sedona/blob/0eae42576c2588fe278f75cef3b17fee600eac90/spark/common/src/test/resources/raster/) and create raw DataFrame. +The output contains the following columns: + +- `rast`: The raster data in `Raster` format. +- `x`: The 0-based x-coordinate of the tile. This column is only present when retile is not disabled. +- `y`: The 0-based y-coordinate of the tile. This column is only present when retile is not disabled. +- `name`: The name of the raster file. + +### Tiling options + +By default, tiling is enabled (`retile = true`) and the tile size is determined by the GeoTiff file's internal tiling scheme — you do not need to specify `tileWidth` or `tileHeight`. It is recommended to use [Cloud Optimized GeoTIFF (COG)](https://www.cogeo.org/) format for raster data since they usually organize pixel data as square tiles. + +You can optionally override the tile size, or disable tiling entirely: + +| Option | Default | Description | +| :--- | :--- | :--- | +| `retile` | `true` | Whether to enable tiling. Set to `false` to load the entire raster as a single row. | +| `tileWidth` | GeoTiff's internal tile width | Optional. Override the width of each tile in pixels. | +| `tileHeight` | Same as `tileWidth` if set, otherwise GeoTiff's internal tile height | Optional. Override the height of each tile in pixels. | +| `padWithNoData` | `false` | Pad the right and bottom tiles with NODATA values if they are smaller than the specified tile size. | + +To override the tile size: + +=== "Python" + ```python + rasterDf = ( + sedona.read.format("raster") + .option("tileWidth", "256") + .option("tileHeight", "256") + .load("/some/path/*.tif") + ) + ``` !!!note - The above code works too for loading multiple raster data files. If the raster files are in separate directories and the option also makes sure that only `.tif` or `.tiff` files are being loaded. + If the internal tiling scheme of raster data is not friendly for tiling, the `raster` data source will throw an error, and you can disable automatic tiling using `option("retile", "false")`, or specify the tile size manually to workaround this issue. A better solution is to translate the raster data into COG format using `gdal_translate` or other tools. + +### Loading raster files from directories + +The `raster` data source also works with Spark generic file source options, such as `option("pathGlobFilter", "*.tif*")` and `option("recursiveFileLookup", "true")`. For instance, you can load all the `.tif` files recursively in a directory using: + +=== "Python" + ```python + rasterDf = ( + sedona.read.format("raster") + .option("recursiveFileLookup", "true") + .option("pathGlobFilter", "*.tif*") + .load(path_to_raster_data_folder) + ) + ``` + +!!!tip + When the loaded path ends with `/`, the `raster` data source will look up raster files in the directory and all its subdirectories recursively. This is equivalent to specifying a path without trailing `/` and setting `option("recursiveFileLookup", "true")`. + +!!!tip + After loading rasters, you can quickly visualize them in a Jupyter notebook using `SedonaUtils.display_image(df)`. It automatically detects raster columns and renders them as images. See [Raster visualizer docs](../api/sql/Raster-Functions.md#raster-output) for details. Review Comment: This tip references `SedonaUtils.display_image(df)` but the tutorial’s DataFrame is named `rasterDf` in the surrounding examples. Using the same variable name here would reduce confusion for readers following along. ```suggestion After loading rasters, you can quickly visualize them in a Jupyter notebook using `SedonaUtils.display_image(rasterDf)`. It automatically detects raster columns and renders them as images. See [Raster visualizer docs](../api/sql/Raster-Functions.md#raster-output) for details. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
