james-willis opened a new issue, #746:
URL: https://github.com/apache/sedona-db/issues/746

   ### Problem
   
   SedonaDB's raster type models data as 2D spatial grids (width × height) with 
bands as a flat list. This can't represent multi-dimensional
   geospatial datasets — climate models with time dimensions, hyperspectral 
imagery, atmospheric profiles with pressure levels, or Zarr/NetCDF datacubes. 
Users must flatten into 2D+band (losing semantics) or leave SedonaDB.
   
   ### Approach
   
   Upgrade each band's data from a 2D tile to an N-D chunk with named 
dimensions and shape. The band/variable structure is preserved — a Zarr 
variable or GeoTIFF band maps to a band, but each band can now have shape 
`[time=12, y=256, x=256]` instead of just `[y=256, x=256]`. Legacy rasters load 
as bands with shape `[y, x]` — zero change for existing workloads.
   
   ### Key decisions
   
   1. **Band = variable, each band is N-D** — Zarr's `temperature`, `pressure`, 
`wind_u` become 3 bands, each an N-D chunk. GeoTIFF bands map directly. Band 
math (`in[0]`, `in[1]`) unchanged.
   
   2. **Named dimensions per band** — Each band stores `dim_names` + `shape`. 
`y`/`x` (or `lat`/`lon`) are the spatial axes with hard-coded meaning; all 
others (`time`, `wavelength`, `pressure`, ...) are arbitrary. Bands may have 
different dimension sets but must agree on shared dimension sizes.
   
   3. **`RS_DimToBand` / `RS_BandToDim`** — Bridge between "everything is a 
dimension" (Zarr) and the band model. `RS_DimToBand(raster, 'wavelength')` 
promotes a within-band dimension into separate bands so standard band math 
works.
   
   4. **Two execution paths** — Native impls for metadata, coordinate 
conversion, predicates, and new N-D functions. GDAL-backed impls for
   compute-heavy spatial ops (clip, zonal stats, map algebra) — these extract 
y/x slices and operate per spatial slice.
   
   5. **Single schema** — Legacy 2D schema retired. All loaders produce N-D 
layout directly. No runtime schema detection needed.
   
   6. **Trait-based band storage** — `NdBandRef` trait with `nd_buffer()` 
(returns raw buffer + shape + strides for zero-copy access) and
   `contiguous_data()` (flat bytes, copies only if strided). Implementations: 
`InMemoryBand` (Phase 1), `ZarrBand` + `LazySlicedBand` (Phase 2), 
`GeoTiffBand` (Phase 2/3). Strided views are just `InMemoryBand` with 
non-standard strides — Arrow BinaryView refcounting handles lifetime.
   
   7. **Affine transform** — Single `transform: List<Float64>` (GDAL 
GeoTransform convention) at raster level. Applies to y/x dims only.
   
   8. **OutDb references** — Single `outdb_uri` field per band with 
scheme-based dispatch (`zarr://...`, `geotiff://...`).
   
   ### Arrow schema
   ```rust
   Struct {
     crs:       Utf8View,
     transform: List,     -- [origin_x, scale_x, skew_x, origin_y, skew_y, 
scale_y]
     bands: List<Struct {
       name:      Utf8,            -- e.g. "temperature" (nullable)
       dim_names: List,      -- ["time", "y", "x"]
       shape:     List,    -- [12, 256, 256]
       data_type: UInt32,
       nodata:    Binary,
       strides:   List,     -- per-dim byte strides
       offset:    UInt64,
       outdb_uri: Utf8,            -- "zarr://s3://bucket/store#var/0.0.0" 
(nullable)
       data:      BinaryView,      -- row-major N-D array (eager) or empty 
(lazy)
     }>
   }
   ```
   ### Core traits
   
   ```rust
   pub struct NdBuffer<'a> {
       pub buffer: &'a [u8],
       pub shape: &'a [u64],
       pub strides: &'a [i64],
       pub offset: u64,
       pub data_type: BandDataType,
   }
   
   pub trait NdBandRef {
       fn ndim(&self) -> usize;
       fn dim_names(&self) -> &[&str];
       fn shape(&self) -> &[u64];
       fn dim_size(&self, name: &str) -> Option<u64>;
       fn data_type(&self) -> BandDataType;
       fn nodata(&self) -> Option<&[u8]>;
   
       /// Raw buffer + strides — for zero-copy consumers (numpy, Arrow FFI).
       /// Triggers load for lazy impls.
       fn nd_buffer(&self) -> Result<NdBuffer<'_>>;
   
       /// Contiguous row-major bytes — copies only if strides are non-standard.
       /// Most RS_* functions use this and never think about strides.
       fn contiguous_data(&self) -> Result<Cow<'_, [u8]>>;
   }
   ```
   ## Phases
   
   Phase 1 (this issue): N-D schema, NdRasterRef/NdBandRef traits, 
InMemoryBand, reimplement all 33 SedonaDB RS_* functions against traits, new 
N-D functions (RS_NumDimensions, RS_DimNames, RS_DimSize, RS_Shape, RS_Slice, 
RS_DimToBand, RS_BandToDim). Strides always contiguous. Crates: sedona-schema, 
sedona-raster, sedona-raster-functions.
   
   Phase 2: Zarr I/O. Add ZarrBand (lazy load on first access) and 
LazySlicedBand (wraps lazy band + slice spec so RS_DimToBand stays lazy). 
Chunk-level caching inside impls. `RS_NormalizedDifference(RS_DimToBand(data, 
'wavelength'), 77, 54)` loads only the chunks for wavelengths 77 and 54.
   
   Phase 3: N-D aggregations (reduce along a dimension), coordinate label 
arrays, dimension algebra.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to