james-willis commented on code in PR #749:
URL: https://github.com/apache/sedona-db/pull/749#discussion_r3114302286


##########
rust/sedona-schema/src/raster.rs:
##########
@@ -55,32 +54,37 @@ impl RasterSchema {
         )))
     }
 
-    /// Individual band schema
+    /// Individual band schema — flattened N-D band with dimension metadata
     pub fn band_type() -> DataType {
         DataType::Struct(Fields::from(vec![
-            Field::new(column::METADATA, Self::band_metadata_type(), false),
-            Field::new(column::DATA, Self::band_data_type(), false),
+            Field::new(column::NAME, DataType::Utf8, true),
+            Field::new(column::DIM_NAMES, Self::dim_names_type(), false),
+            Field::new(column::SHAPE, Self::shape_type(), false),
+            Field::new(column::DATATYPE, DataType::UInt32, false),
+            Field::new(column::NODATA, DataType::Binary, true),
+            Field::new(column::STRIDES, Self::strides_type(), false),
+            Field::new(column::OFFSET, DataType::UInt64, false),
+            Field::new(column::OUTDB_URI, DataType::Utf8, true),
+            Field::new(column::DATA, DataType::BinaryView, false),

Review Comment:
   I intend to solve this issue by having has-a relationships between BandRef 
implementations. Imagine the case:
   ```sql
   SELECT RS_Width(RS_Downsample(RS_Rechunk(rasterCol))) FROM zarr_table
   ```
   
   We never need the actual pixels read here so if we can keep this all lazy 
it'll be a big win.
   
   The `rasterCol` would be composed of RasterRefs that contains lazy 
ZarrBandRefs.
   
   ### RS_Rechunk 
   two possibilities:
   1. The simple lazy approach: a lazy StridedBandRef encodes the subsetting 
and points to the original ZarrBandRef. This reference to the original 
ZarrBandRef is used when `nd_buffer` or `contiguous_data` is called.
   2. A new ZarrBandRef that encodes the subsetting bounds into the outdb_uri. 
This assumes the reader supports subsetting.
   
   ### RS_Downsamples
   this one changes the dimensions of the dataset. obviously an eager 
implementation is pretty simple, but the composition approach allows us to 
continue to be lazy.
   
   A CoursenBandRef implements `shape` from the metadata of its child BandRef. 
It would only perform an actual coarsen if a data retrieval method is called. 
In the coarsen case, being lazy with the child BandRef also can reduce peak 
memory consumption because if the child can stream the input data, we never 
hold both the original and result in memory at once.
   
   ### RS_Width
   This is trivial - it returns a scalar so all it does is call the `shape` 
method(s) of the input BandRef.
   
   ## Limitations
   This stacking only works when theres no serialization between the functions 
(e.g. a shuffle) we are stacking together. serialization from this composition 
solution is lossy. 
   
   When a shuffle occurs we need to compress all the stacking down into 
something that can be represented in the arrow metadata. I think these cases 
are fairly limited and can be further reduces by using Deferred Expressions 
(Dewey's idea), which is an optimization we can do sometime later. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to