james-willis commented on code in PR #749:
URL: https://github.com/apache/sedona-db/pull/749#discussion_r3114302286
##########
rust/sedona-schema/src/raster.rs:
##########
@@ -55,32 +54,37 @@ impl RasterSchema {
)))
}
- /// Individual band schema
+ /// Individual band schema — flattened N-D band with dimension metadata
pub fn band_type() -> DataType {
DataType::Struct(Fields::from(vec![
- Field::new(column::METADATA, Self::band_metadata_type(), false),
- Field::new(column::DATA, Self::band_data_type(), false),
+ Field::new(column::NAME, DataType::Utf8, true),
+ Field::new(column::DIM_NAMES, Self::dim_names_type(), false),
+ Field::new(column::SHAPE, Self::shape_type(), false),
+ Field::new(column::DATATYPE, DataType::UInt32, false),
+ Field::new(column::NODATA, DataType::Binary, true),
+ Field::new(column::STRIDES, Self::strides_type(), false),
+ Field::new(column::OFFSET, DataType::UInt64, false),
+ Field::new(column::OUTDB_URI, DataType::Utf8, true),
+ Field::new(column::DATA, DataType::BinaryView, false),
Review Comment:
I intend to solve this issue by having has-a relationships between BandRef
implementations. Imagine the case:
```sql
SELECT RS_Width(RS_Downsample(RS_Rechunk(rasterCol))) FROM zarr_table
```
We never need the actual pixels read here so if we can keep this all lazy
it'll be a big win.
The `rasterCol` would be composed of RasterRefs that contains lazy
ZarrBandRefs.
### RS_Rechunk
two possibilities:
1. The simple lazy approach: a lazy StridedBandRef encodes the subsetting
and points to the original ZarrBandRef. This reference to the original
ZarrBandRef is used when `nd_buffer` or `contiguous_data` is called.
2. A new ZarrBandRef that encodes the subsetting bounds into the outdb_uri.
This assumes the reader supports subsetting.
### RS_Downsamples
this one changes the dimensions of the dataset. obviously an eager
implementation is pretty simple, but the composition approach allows us to
continue to be lazy.
A CoursenBandRef implements `shape` from the metadata of its child BandRef.
It would only perform an actual coarsen if a data retrieval method is called.
In the coarsen case, being lazy with the child BandRef also can reduce peak
memory consumption because if the child can stream the input data, we never
hold both the original and result in memory at once.
### RS_Width
This is trivial - it returns a scalar so all it does is call the `shape`
method(s) of the input BandRef.
## Limitations
This stacking only works when theres no serialization between the functions
(e.g. a shuffle) we are stacking together. serialization from this composition
solution is lossy.
When a shuffle occurs we need to compress all the stacking down into
something that can be represented in the arrow metadata. I think these cases
are fairly limited and can be further reduces by using Deferred Expressions
(Dewey's idea), which is an optimization we can do sometime later.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]