jiayuasu opened a new pull request, #2846:
URL: https://github.com/apache/sedona/pull/2846

   ## Did you read the Contributor Guide?
   
   - Yes, I have read the [Contributor 
Rules](https://sedona.apache.org/latest/community/rule/) and [Contributor 
Development Guide](https://sedona.apache.org/latest/community/develop/)
   
   ## Is this PR related to a ticket?
   
   - Yes, and the PR name follows the format `[GH-XXX] my subject`. Closes #2824
   
   ## What changes were proposed in this PR?
   
   Add a new Spark DataSourceV2 (`geotiffinfo`) that reads GeoTIFF file 
metadata without decoding pixel data, similar to 
[gdalinfo](https://gdal.org/en/stable/programs/gdalinfo.html).
   
   ### Usage
   
   ```scala
   spark.read.format("geotiffinfo").load("/path/to/rasters/")
   
   // COG detection
   spark.read.format("geotiffinfo").load("/path/to/*.tif")
     .filter("isTiled AND size(overviews) > 0")
   ```
   
   ### Output schema
   
   Returns one row per file with: `path`, `driver`, `fileSize`, `width`, 
`height`, `numBands`, `srid`, `crs`, `geoTransform` (struct), 
`cornerCoordinates` (struct), `bands` (array of structs with dataType, noData, 
blockSize, colorInterpretation, unit), `overviews` (array of structs with 
level, width, height), `metadata` (map), `isTiled`, `compression`.
   
   ### Key implementation details
   
   - `isTiled`: reads TIFF TileWidth tag (322) from IIO metadata, not 
RenderedImage tile size
   - `colorInterpretation`: derived from TIFF Photometric tag (262) — Gray, 
Red, Green, Blue, Alpha, Palette
   - `compression`: reads TIFF tag 259 description for human-readable names 
(LZW, Deflate, etc.)
   - `overviews`: uses `DatasetLayout.getNumInternalOverviews()` for real 
overview count, not synthetic tile-based levels
   - Schema-aware column pruning via `readDataSchema`
   - TIFF IIO metadata extracted before `reader.read()` to avoid stream state 
issues
   - Read-only: `newWriteBuilder` throws `UnsupportedOperationException`
   
   ### Files
   
   - **New package**: `spark/common/.../io/geotiffinfo/` — 5 Scala files 
(DataSource, Table, ScanBuilder+Scan, PartitionReaderFactory, PartitionReader)
   - **Service registration**: Added to 
`META-INF/services/org.apache.spark.sql.sources.DataSourceRegister`
   - **Documentation**: `docs/tutorial/files/geotiffinfo-sedona-spark.md`
   - **mkdocs.yml**: Added navigation entry
   
   ## How was this patch tested?
   
   11 tests in `geotiffInfoTest.scala` with exact-match assertions:
   
   - test1.tiff metadata: width=512, height=517, srid=3857, fileSize=174803, 
band type=UNSIGNED_8BITS, blockSize=256x256, colorInterpretation=Gray, 
geoTransform values, cornerCoordinates values
   - Cross-validation against `format("raster")` + 
`RS_Width`/`RS_Height`/`RS_NumBands`/`RS_SRID`
   - COG test: generates COG on-the-fly via `RS_AsCOG`, verifies 
`isTiled=true`, 2 overviews, `blockSize=256x256`
   - Empty overviews for non-COG test1.tiff (verified 1 IFD only)
   - Glob pattern loading (7 `.tiff` files) and recursive directory loading (9 
files total)
   - LIMIT pushdown and column pruning
   
   ## Did this PR include necessary documentation updates?
   
   - Yes, I have updated the documentation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to