paleolimbot commented on code in PR #749:
URL: https://github.com/apache/sedona-db/pull/749#discussion_r3041261722
##########
rust/sedona-schema/src/raster.rs:
##########
@@ -16,34 +16,33 @@
// under the License.
use arrow_schema::{DataType, Field, FieldRef, Fields};
-/// Schema for storing raster data in Apache Arrow format.
-/// Utilizing nested structs and lists to represent raster metadata and bands.
+/// Schema for storing N-dimensional raster data in Apache Arrow format.
+///
+/// Each raster has a CRS, an affine transform, explicit spatial dimension
names
+/// (`x_dim`, `y_dim`), and a list of bands. Each band is an N-D chunk with
named
+/// dimensions, a shape, and optional strides for zero-copy slicing.
+///
+/// Legacy 2D rasters are represented as bands with `dim_names=["y","x"]` and
+/// `shape=[height, width]`.
#[derive(Debug, PartialEq, Clone)]
pub struct RasterSchema;
+
impl RasterSchema {
/// Returns the top-level fields for the raster schema structure.
pub fn fields() -> Fields {
Fields::from(vec![
- Field::new(column::METADATA, Self::metadata_type(), false),
- Field::new(column::CRS, Self::crs_type(), true), // Optional: may
be inferred from data
+ Field::new(column::CRS, Self::crs_type(), true),
+ Field::new(column::TRANSFORM, Self::transform_type(), false),
+ Field::new(column::X_DIM, DataType::Utf8View, false),
+ Field::new(column::Y_DIM, DataType::Utf8View, false),
Field::new(column::BANDS, Self::bands_type(), true),
])
}
- /// Raster metadata schema
- pub fn metadata_type() -> DataType {
- DataType::Struct(Fields::from(vec![
- // Raster dimensions
- Field::new(column::WIDTH, DataType::UInt64, false),
- Field::new(column::HEIGHT, DataType::UInt64, false),
- // Geospatial transformation parameters
- Field::new(column::UPPERLEFT_X, DataType::Float64, false),
- Field::new(column::UPPERLEFT_Y, DataType::Float64, false),
- Field::new(column::SCALE_X, DataType::Float64, false),
- Field::new(column::SCALE_Y, DataType::Float64, false),
- Field::new(column::SKEW_X, DataType::Float64, false),
- Field::new(column::SKEW_Y, DataType::Float64, false),
- ]))
+ /// Affine transform schema — 6-element GDAL GeoTransform:
+ /// `[origin_x, scale_x, skew_x, origin_y, skew_y, scale_y]`
+ pub fn transform_type() -> DataType {
+ DataType::List(FieldRef::new(Field::new("item", DataType::Float64,
false)))
}
Review Comment:
Just highlighting for myself + others that this is the main change.
The main conceptual change here seems to be that the height and width in 2D
space are now members of the bands. Even though it might involve some
duplication, I wonder if a top-level shape (whose ordering is always x, y, and
possibly z in some future) would still be appropriate. The bands would then all
be parsed according to the x/y dim name and validated against the top-level
declaration.
X_DIM + Y_DIM should probably also be a list to allow for future Z. Lists
are kind of a pain in Rust but interacting with these in Python is fairly easy
to do and this PR is probably the only one that needs to consider that detail.
##########
rust/sedona-schema/src/raster.rs:
##########
@@ -55,32 +54,37 @@ impl RasterSchema {
)))
}
- /// Individual band schema
+ /// Individual band schema — flattened N-D band with dimension metadata
pub fn band_type() -> DataType {
DataType::Struct(Fields::from(vec![
- Field::new(column::METADATA, Self::band_metadata_type(), false),
- Field::new(column::DATA, Self::band_data_type(), false),
+ Field::new(column::NAME, DataType::Utf8, true),
+ Field::new(column::DIM_NAMES, Self::dim_names_type(), false),
+ Field::new(column::SHAPE, Self::shape_type(), false),
+ Field::new(column::DATATYPE, DataType::UInt32, false),
+ Field::new(column::NODATA, DataType::Binary, true),
+ Field::new(column::STRIDES, Self::strides_type(), false),
+ Field::new(column::OFFSET, DataType::UInt64, false),
+ Field::new(column::OUTDB_URI, DataType::Utf8, true),
+ Field::new(column::DATA, DataType::BinaryView, false),
Review Comment:
Highlighting changes here to check my understanding...the main change is
that bands have their own dimensions, with the addition of dim_names. The
spatial dimensions of all the bands have to match. Here outdb_uri typically
refers to a portion of a raster (e.g., COG tile) rather than a full array.
This also introduces strides/offset, which comes from pybuffer (
https://docs.python.org/3/c-api/buffer.html ) terminology. This allows some
(but not all) subsets to be propagated without modifying `DATA` or `OUTDB_URI`.
In particular, because OUTDB_URI doesn't have to be modified, we can delay
loading tiles for longer.
I am a tiny bit worried that this won't be sufficient to avoid loading all
of the things we want to avoid loading, but also we maybe don't have the full
picture on what the subset specification looks like yet. Maybe this can be
`subset: Utf8` and we can sort out the exact subset specifications we'll allow
and how to resolve them later (e.g., GDAL has its own way to specify this as
well).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]