jiayuasu opened a new pull request, #2846: URL: https://github.com/apache/sedona/pull/2846
## Did you read the Contributor Guide? - Yes, I have read the [Contributor Rules](https://sedona.apache.org/latest/community/rule/) and [Contributor Development Guide](https://sedona.apache.org/latest/community/develop/) ## Is this PR related to a ticket? - Yes, and the PR name follows the format `[GH-XXX] my subject`. Closes #2824 ## What changes were proposed in this PR? Add a new Spark DataSourceV2 (`geotiffinfo`) that reads GeoTIFF file metadata without decoding pixel data, similar to [gdalinfo](https://gdal.org/en/stable/programs/gdalinfo.html). ### Usage ```scala spark.read.format("geotiffinfo").load("/path/to/rasters/") // COG detection spark.read.format("geotiffinfo").load("/path/to/*.tif") .filter("isTiled AND size(overviews) > 0") ``` ### Output schema Returns one row per file with: `path`, `driver`, `fileSize`, `width`, `height`, `numBands`, `srid`, `crs`, `geoTransform` (struct), `cornerCoordinates` (struct), `bands` (array of structs with dataType, noData, blockSize, colorInterpretation, unit), `overviews` (array of structs with level, width, height), `metadata` (map), `isTiled`, `compression`. ### Key implementation details - `isTiled`: reads TIFF TileWidth tag (322) from IIO metadata, not RenderedImage tile size - `colorInterpretation`: derived from TIFF Photometric tag (262) — Gray, Red, Green, Blue, Alpha, Palette - `compression`: reads TIFF tag 259 description for human-readable names (LZW, Deflate, etc.) - `overviews`: uses `DatasetLayout.getNumInternalOverviews()` for real overview count, not synthetic tile-based levels - Schema-aware column pruning via `readDataSchema` - TIFF IIO metadata extracted before `reader.read()` to avoid stream state issues - Read-only: `newWriteBuilder` throws `UnsupportedOperationException` ### Files - **New package**: `spark/common/.../io/geotiffinfo/` — 5 Scala files (DataSource, Table, ScanBuilder+Scan, PartitionReaderFactory, PartitionReader) - **Service registration**: Added to `META-INF/services/org.apache.spark.sql.sources.DataSourceRegister` - **Documentation**: `docs/tutorial/files/geotiffinfo-sedona-spark.md` - **mkdocs.yml**: Added navigation entry ## How was this patch tested? 11 tests in `geotiffInfoTest.scala` with exact-match assertions: - test1.tiff metadata: width=512, height=517, srid=3857, fileSize=174803, band type=UNSIGNED_8BITS, blockSize=256x256, colorInterpretation=Gray, geoTransform values, cornerCoordinates values - Cross-validation against `format("raster")` + `RS_Width`/`RS_Height`/`RS_NumBands`/`RS_SRID` - COG test: generates COG on-the-fly via `RS_AsCOG`, verifies `isTiled=true`, 2 overviews, `blockSize=256x256` - Empty overviews for non-COG test1.tiff (verified 1 IFD only) - Glob pattern loading (7 `.tiff` files) and recursive directory loading (9 files total) - LIMIT pushdown and column pruning ## Did this PR include necessary documentation updates? - Yes, I have updated the documentation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
