jiayuasu commented on code in PR #2802: URL: https://github.com/apache/sedona/pull/2802#discussion_r3006895209
########## docs/tutorial/raster.md: ########## @@ -142,68 +142,121 @@ Add the following line after creating the Sedona config. If you already have a S You can also register everything by passing `--conf spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to `spark-submit` or `spark-shell`. -## Load data from files +## Load GeoTiff data -Assume we have a single raster data file called rasterData.tiff, [at Path](https://github.com/apache/sedona/blob/0eae42576c2588fe278f75cef3b17fee600eac90/spark/common/src/test/resources/raster/raster_with_no_data/test5.tiff). - -Use the following code to load the data and create a raw Dataframe. +The recommended way to load GeoTiff raster data is the `raster` data source. It loads GeoTiff files and automatically splits them into smaller tiles. Each tile becomes a row in the resulting DataFrame stored in `Raster` format. === "Scala" ```scala - var rawDf = sedona.read.format("binaryFile").load(path_to_raster_data) - rawDf.createOrReplaceTempView("rawdf") - rawDf.show() + var rasterDf = sedona.read.format("raster").load("/some/path/*.tif") + rasterDf.createOrReplaceTempView("rasterDf") + rasterDf.show() ``` === "Java" ```java - Dataset<Row> rawDf = sedona.read.format("binaryFile").load(path_to_raster_data) - rawDf.createOrReplaceTempView("rawdf") - rawDf.show() + Dataset<Row> rasterDf = sedona.read().format("raster").load("/some/path/*.tif"); + rasterDf.createOrReplaceTempView("rasterDf"); + rasterDf.show(); ``` === "Python" ```python - rawDf = sedona.read.format("binaryFile").load(path_to_raster_data) - rawDf.createOrReplaceTempView("rawdf") - rawDf.show() + rasterDf = sedona.read.format("raster").load("/some/path/*.tif") + rasterDf.createOrReplaceTempView("rasterDf") + rasterDf.show() ``` The output will look like this: ``` -| path| modificationTime|length| content| -+--------------------+--------------------+------+--------------------+ -|file:/Download/ra...|2023-09-06 16:24:...|174803|[49 49 2A 00 08 0...| ++--------------------+---+---+----+ +| rast| x| y|name| ++--------------------+---+---+----+ +|GridCoverage2D["g...| 0| 0| ...| +|GridCoverage2D["g...| 1| 0| ...| +|GridCoverage2D["g...| 2| 0| ...| +... ``` -For multiple raster data files use the following code to load the data [from path](https://github.com/apache/sedona/blob/0eae42576c2588fe278f75cef3b17fee600eac90/spark/common/src/test/resources/raster/) and create raw DataFrame. +The output contains the following columns: + +- `rast`: The raster data in `Raster` format. +- `x`: The 0-based x-coordinate of the tile. This column is only present when retile is not disabled. +- `y`: The 0-based y-coordinate of the tile. This column is only present when retile is not disabled. +- `name`: The name of the raster file. + +### Tiling options + +By default, tiling is enabled (`retile = true`) and the tile size is determined by the GeoTiff file's internal tiling scheme — you do not need to specify `tileWidth` or `tileHeight`. It is recommended to use [Cloud Optimized GeoTIFF (COG)](https://www.cogeo.org/) format for raster data since they usually organize pixel data as square tiles. + +You can optionally override the tile size, or disable tiling entirely: + +| Option | Default | Description | +| :--- | :--- | :--- | +| `retile` | `true` | Whether to enable tiling. Set to `false` to load the entire raster as a single row. | +| `tileWidth` | GeoTiff's internal tile width | Optional. Override the width of each tile in pixels. | +| `tileHeight` | Same as `tileWidth` if set, otherwise GeoTiff's internal tile height | Optional. Override the height of each tile in pixels. | +| `padWithNoData` | `false` | Pad the right and bottom tiles with NODATA values if they are smaller than the specified tile size. | + +To override the tile size: + +=== "Python" + ```python + rasterDf = ( + sedona.read.format("raster") + .option("tileWidth", "256") + .option("tileHeight", "256") + .load("/some/path/*.tif") + ) + ``` !!!note - The above code works too for loading multiple raster data files. If the raster files are in separate directories and the option also makes sure that only `.tif` or `.tiff` files are being loaded. + If the internal tiling scheme of raster data is not friendly for tiling, the `raster` data source will throw an error, and you can disable automatic tiling using `option("retile", "false")`, or specify the tile size manually to workaround this issue. A better solution is to translate the raster data into COG format using `gdal_translate` or other tools. + +### Loading raster files from directories + +The `raster` data source also works with Spark generic file source options, such as `option("pathGlobFilter", "*.tif*")` and `option("recursiveFileLookup", "true")`. For instance, you can load all the `.tif` files recursively in a directory using: + +=== "Python" + ```python + rasterDf = ( + sedona.read.format("raster") + .option("recursiveFileLookup", "true") + .option("pathGlobFilter", "*.tif*") + .load(path_to_raster_data_folder) + ) + ``` + +!!!tip + When the loaded path ends with `/`, the `raster` data source will look up raster files in the directory and all its subdirectories recursively. This is equivalent to specifying a path without trailing `/` and setting `option("recursiveFileLookup", "true")`. + +!!!tip + After loading rasters, you can quickly visualize them in a Jupyter notebook using `SedonaUtils.display_image(df)`. It automatically detects raster columns and renders them as images. See [Raster visualizer docs](../api/sql/Raster-Functions.md#raster-output) for details. + +## Load non-GeoTiff data (NetCDF, Arc Grid) + +For non-GeoTiff raster formats such as NetCDF or Arc Info ASCII Grid, use the Spark built-in `binaryFile` data source together with Sedona raster constructors. + +### Step 1: Load to a binary DataFrame === "Scala" ```scala - var rawDf = sedona.read.format("binaryFile").option("recursiveFileLookup", "true").option("pathGlobFilter", "*.tif*").load(path_to_raster_data_folder) + var rawDf = sedona.read.format("binaryFile").load(path_to_raster_data) rawDf.createOrReplaceTempView("rawdf") rawDf.show() ``` === "Java" ```java - Dataset<Row> rawDf = sedona.read.format("binaryFile").option("recursiveFileLookup", "true").option("pathGlobFilter", "*.tif*").load(path_to_raster_data_folder); + Dataset<Row> rawDf = sedona.read.format("binaryFile").load(path_to_raster_data); Review Comment: Fixed in ccd83eb. Changed to `sedona.read().format("binaryFile")`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
