Dear Even,

Thank you for exploring the possibility of making GDAL more multidimensional. 
In the 10+ years I've lurked on this list I recall this coming up periodically. 
It's a hard problem; perhaps now will be the time to actually do it!

Over that period, I have worked with a lot of 3D (x,y,z and x,y,time) and 4D 
(x,y,z,time) datasets from the ocean remote sensing and modeling communities. 
I'm a marine spatial ecologist; most of my work involves trying to explain or 
predict distributions of marine species from observations or models of 
oceanographic variables such as sea surface temperature (SST) or chlorophyll 
concentration. I want to describe two of the most common scenarios involving 
multidimensional data in the hope that it can inform your thinking about ways 
to enhance GDAL's API to be multidimensional. Hopefully you'll find this 
useful; apologies if not!

Scenario 1: Sampling a 3D or 4D raster dataset with a vector dataset:

In this scenario, we have vector features, usually points, for which we want to 
obtain values from a 3D or 4D raster dataset. For example, the vector features 
might be points representing the locations of satellite-tracked animals, of 
survey vessels, or hauls of fishing gear. The vectors usually have a datetime 
attribute, and possibly a depth attribute. The raster dataset might be a time 
series of satellite SST images with dimensions lon, lat, and time, or a 
time+depth series of water temperature from a 4D ocean model with dimensions 
lon, lat, depth, and time. We need to match up each point to the closest 2D 
slice based on time and/or depth, and then index into the image with lon and 
lat and extract the value of the cell to an attribute of the point. From there 
we can perform statistical analysis on the sampled values.

Usually the vectors are points, but occasionally we may have lines or polygons 
over which we'd like to summarize the raster data, e.g. find the mean value of 
the raster inside each polygon or along a line. This is much more complicated 
and, in my experience, is much less commonly done than points. Also, just as 
"bilinear interpolation" might be used to interpolate a value from a point on a 
2D raster, we occasionally want to extend this to do "trilinear interpolation" 
on, say, an x,y,t image, so that the interpolated value is based on the 8 
surrounding cells (or 16 surrounding cells with x,y,z,t).

Scenario 2: Summarizing 3D or 4D raster datasets into 2D summary images:

In this scenario, we have a time series of images, either x,y,t or x,y,z,t, 
that occur at some high temporal frequency, e.g. one image per day, and we want 
to statistically summarize them at a lower frequency. For example, we might 
want to take daily SST images and create monthly averages (365 images/year --> 
12 images/year) or even monthly climatological averages (365 images/year * n 
years --> 12 rasters, one for each month averaged across all years).

Implications for GDAL's API:

These scenarios are very high level, and I would not expect GDAL's API to 
expose methods for performing them. These are geoprocessing operations that 
should be implemented at some higher layer that itself would make use of GDAL. 
So the question is: what should GDAL's API provide, if anything, to help 
facilitate implementation of the scenarios by that higher layer? Here are some 
things to consider.

1. Aggregating a series of low dimension datasets into a single virtual higher 
dimension dataset (e.g. 2D to 3D):

In the communities I work with, it is fairly rare for someone to release a 
single multidimensional file that spans more than one time slice. For example, 
it is rare for someone to release a single netCDF of SST with dimensions x,y,t 
with the length of t being 365, representing one year of daily images. Instead 
they will release 365 netCDFs, each containing a 2D dataset, or perhaps a 3D 
dataset with a single time slice. As new data are collected, they issue new 
files.

To use such data for scenarios 1 and 2 (and others), it is very convenient for 
these series of low dimension datasets to be aggregated into a single virtual 
high dimension dataset that can then be operated over. netCDF has long had 
special facilities for this, such as thredds/opendap and NcML <aggregation>. 
The GDAL API could expose a similar capability that would allow aggregation of 
arbitrary raster data in this way (rather than just netCDFs).

This capability would be a challenge to implement, no doubt, but it would 
provide a lot of value to users of the GDAL API. There could be some synergy 
with the .vrt driver. Particularly valuable would be the ability to point GDAL 
to a directory tree of datasets and have it aggregate them.

2. Special handling for aggregations of periodic datasets:

Many data providers produce datasets that have a regular periodicity. For 
example, they might release one image per day from a polar orbiting satellite. 
For these datasets, when performing calculations on the time dimension, it is 
convenient to assume that there will always be 365 time slices per year (or 366 
on leap years). But, sadly, data providers occasionally experience problems and 
sometimes do not release time slices. Now, suddenly, there are not 365 images 
with the year 2017, but only 362 because 3 were not produced. Users regularly 
get tripped up by this. For example, in the popular HYCOM ocean model, 
inquiries about this happen on their mailing list every few months, prompting 
the HYCOM team to prepare a multi-page FAQ just on this issue.

There are also datasets that could described as semi-regular. In the 
oceanographic community, it is common to have rasters that span 8 days. But 
these datasets often "start over" on 1 January every year, such that the first 
dataset spans 1-8 January, the second 9-16 January, etc. The 46th one usually 
spans days 361-365 (or 366) or maybe includes 2-3 days of the new year, but 
then the next year starts again at 1 January.  

So, assuming GDAL were capable of performing aggregations, it would be useful 
to describe to GDAL (or have it otherwise deduce) the periodicity of datasets 
and then itself generate missing time slices virtually. These slices would just 
be filled with the "no data" value. 

It would also be useful for the API provide an abstraction for the concept of 
periodicity, so that a caller can get/set it. I fully recognize this is 
difficult, in that there are so many ways that datasets can be periodic, and 
that it might not be possible to invent a scheme for representing them all. But 
there are many datasets that fall under common schemes (e.g. one image every 
day, 8-days, month, year, etc.) and this might be a situation in which a simple 
solution can cover 95% of the datasets out there and provide good value.

A lot of these issues about time have been considered before by others, e.g. by 
those developing netCDF metadata standards. This raises the question: should 
GDAL even consider getting involved in this area? Isn't it really tricky? Yes, 
probably. But if GDAL treats time completely generically, the new API might not 
provide that much value of what can already be done, e.g. with subdatasets. 
(But I do recognize that this is a potential minefield.)

3. Lazy evaluation:

If an aggregate dataset is spread across many files, it would be good if GDAL 
could avoid iterating over all of those files unless absolutely necessary. For 
example, it would be unfortunate if GDAL could only determine the time or depth 
coordinate of a file by opening it and reading a netCDF attribute. Consider a 
3D dataset of daily SST images composed of 10,000 netCDF files. It would be 
very slow if GDAL had to open each one in order to return an array of time 
dimension values. In some cases this might be unavoidable, but GDAL should 
provide the means to avoid it, if possible, e.g. by telling GDAL how to parse 
the times from the file names. You are probably well aware of these issues in 
developing the .vrt driver and similar projects.

I hope this helps. Thanks for your time and consideration.

Jason

> -----Original Message-----
> From: gdal-dev <gdal-dev-boun...@lists.osgeo.org> On Behalf Of Even
> Rouault
> Sent: Friday, May 24, 2019 11:01 AM
> To: gdal-dev@lists.osgeo.org
> Subject: [gdal-dev] "RFC 75: Multidimensional arrays" available for
> preliminary review
> 
> Hi,
> 
> I've prepared a preliminary version of a new RFC to add support for
> multidimensional arrays
> 
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_rouault_gdal_blob_rfc75-
> 5Ftext_gdal_doc_source_development_rfc_rfc75-5Fmultidimensional-
> 5Farrays.rst&d=DwIGaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=cJfJ
> 4ejc1xbg_qb47Pg1OoRq1GfFGvWbDD2PT7-
> fBKk&m=z4DIC0ByQfvEPOoQULwLWJUCYSJ_rD9wnK56bSk-088&s=QoB-
> _CuLOE_hxDlvaGk9pf_1E6LVycCraOPSTwANhCk&e=
> 
> If you want to comment it inline, it is also available as a pull request
> at https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_OSGeo_gdal_pull_1580&d=DwIGaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1
> i6YHLR0Sj_gZ4adc&r=cJfJ4ejc1xbg_qb47Pg1OoRq1GfFGvWbDD2PT7-
> fBKk&m=z4DIC0ByQfvEPOoQULwLWJUCYSJ_rD9wnK56bSk-
> 088&s=5ZZMNYskn0v4Jzk4QvfTKYSArMmQZEZTTiH3uxtCVAI&e=
> 
> This is preliminary content to give an idea of the general directions I
> have in mind for now.
> Not backed by any implementation yet (should start soon hopefully), so API
> details will likely change, but early feedback is welcome.
> 
> Even
> 
> --
> Spatialys - Geospatial professional services
> https://urldefense.proofpoint.com/v2/url?u=http-
> 3A__www.spatialys.com&d=DwIGaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4a
> dc&r=cJfJ4ejc1xbg_qb47Pg1OoRq1GfFGvWbDD2PT7-
> fBKk&m=z4DIC0ByQfvEPOoQULwLWJUCYSJ_rD9wnK56bSk-088&s=vEHFxJLIKrMqTFFYqD_x-
> oa1BtTCt8ppUhHU1HakMoA&e=
> _______________________________________________
> gdal-dev mailing list
> gdal-dev@lists.osgeo.org
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__lists.osgeo.org_mailman_listinfo_gdal-
> 2Ddev&d=DwIGaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=cJfJ4ejc1xb
> g_qb47Pg1OoRq1GfFGvWbDD2PT7-
> fBKk&m=z4DIC0ByQfvEPOoQULwLWJUCYSJ_rD9wnK56bSk-
> 088&s=tUoKHatLSi1xtfDf1EYAGTrYwf99aV9Ky_Sa95QU54w&e=
_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Reply via email to