Dear Jason, it's great to read your excellent writeup of situations and requirements! FYI, GDAL already has a rasdaman driver where rasdaman (www.rasdaman.org) is an Array Database doing all you describe (plus a few more things). However, currently GDAL utilizes rasdaman only as a simple file access mechanism. Instead, queries could be pushed down from GDAL - that might give a low-hanging fruit for adding true nD to GDAL. With rasdaman there are 2 interfaces: rasql (no georeferencing, SQL style, just syntactic deviations to the ISO SQL/MDA datacube standard) and OGC WCS/WCPS (the OGC datacube access/analytics standards, with space/time georeferencing).
Would some sort of collaboration be of interest? cheers, Peter On 29.05.19 20:37, Jason Roberts wrote: > Dear Even, > > Thank you for exploring the possibility of making GDAL more multidimensional. > In the 10+ years I've lurked on this list I recall this coming up > periodically. It's a hard problem; perhaps now will be the time to actually > do it! > > Over that period, I have worked with a lot of 3D (x,y,z and x,y,time) and 4D > (x,y,z,time) datasets from the ocean remote sensing and modeling communities. > I'm a marine spatial ecologist; most of my work involves trying to explain or > predict distributions of marine species from observations or models of > oceanographic variables such as sea surface temperature (SST) or chlorophyll > concentration. I want to describe two of the most common scenarios involving > multidimensional data in the hope that it can inform your thinking about ways > to enhance GDAL's API to be multidimensional. Hopefully you'll find this > useful; apologies if not! > > Scenario 1: Sampling a 3D or 4D raster dataset with a vector dataset: > > In this scenario, we have vector features, usually points, for which we want > to obtain values from a 3D or 4D raster dataset. For example, the vector > features might be points representing the locations of satellite-tracked > animals, of survey vessels, or hauls of fishing gear. The vectors usually > have a datetime attribute, and possibly a depth attribute. The raster dataset > might be a time series of satellite SST images with dimensions lon, lat, and > time, or a time+depth series of water temperature from a 4D ocean model with > dimensions lon, lat, depth, and time. We need to match up each point to the > closest 2D slice based on time and/or depth, and then index into the image > with lon and lat and extract the value of the cell to an attribute of the > point. From there we can perform statistical analysis on the sampled values. > > Usually the vectors are points, but occasionally we may have lines or > polygons over which we'd like to summarize the raster data, e.g. find the > mean value of the raster inside each polygon or along a line. This is much > more complicated and, in my experience, is much less commonly done than > points. Also, just as "bilinear interpolation" might be used to interpolate a > value from a point on a 2D raster, we occasionally want to extend this to do > "trilinear interpolation" on, say, an x,y,t image, so that the interpolated > value is based on the 8 surrounding cells (or 16 surrounding cells with > x,y,z,t). > > Scenario 2: Summarizing 3D or 4D raster datasets into 2D summary images: > > In this scenario, we have a time series of images, either x,y,t or x,y,z,t, > that occur at some high temporal frequency, e.g. one image per day, and we > want to statistically summarize them at a lower frequency. For example, we > might want to take daily SST images and create monthly averages (365 > images/year --> 12 images/year) or even monthly climatological averages (365 > images/year * n years --> 12 rasters, one for each month averaged across all > years). > > Implications for GDAL's API: > > These scenarios are very high level, and I would not expect GDAL's API to > expose methods for performing them. These are geoprocessing operations that > should be implemented at some higher layer that itself would make use of > GDAL. So the question is: what should GDAL's API provide, if anything, to > help facilitate implementation of the scenarios by that higher layer? Here > are some things to consider. > > 1. Aggregating a series of low dimension datasets into a single virtual > higher dimension dataset (e.g. 2D to 3D): > > In the communities I work with, it is fairly rare for someone to release a > single multidimensional file that spans more than one time slice. For > example, it is rare for someone to release a single netCDF of SST with > dimensions x,y,t with the length of t being 365, representing one year of > daily images. Instead they will release 365 netCDFs, each containing a 2D > dataset, or perhaps a 3D dataset with a single time slice. As new data are > collected, they issue new files. > > To use such data for scenarios 1 and 2 (and others), it is very convenient > for these series of low dimension datasets to be aggregated into a single > virtual high dimension dataset that can then be operated over. netCDF has > long had special facilities for this, such as thredds/opendap and NcML > <aggregation>. The GDAL API could expose a similar capability that would > allow aggregation of arbitrary raster data in this way (rather than just > netCDFs). > > This capability would be a challenge to implement, no doubt, but it would > provide a lot of value to users of the GDAL API. There could be some synergy > with the .vrt driver. Particularly valuable would be the ability to point > GDAL to a directory tree of datasets and have it aggregate them. > > 2. Special handling for aggregations of periodic datasets: > > Many data providers produce datasets that have a regular periodicity. For > example, they might release one image per day from a polar orbiting > satellite. For these datasets, when performing calculations on the time > dimension, it is convenient to assume that there will always be 365 time > slices per year (or 366 on leap years). But, sadly, data providers > occasionally experience problems and sometimes do not release time slices. > Now, suddenly, there are not 365 images with the year 2017, but only 362 > because 3 were not produced. Users regularly get tripped up by this. For > example, in the popular HYCOM ocean model, inquiries about this happen on > their mailing list every few months, prompting the HYCOM team to prepare a > multi-page FAQ just on this issue. > > There are also datasets that could described as semi-regular. In the > oceanographic community, it is common to have rasters that span 8 days. But > these datasets often "start over" on 1 January every year, such that the > first dataset spans 1-8 January, the second 9-16 January, etc. The 46th one > usually spans days 361-365 (or 366) or maybe includes 2-3 days of the new > year, but then the next year starts again at 1 January. > > So, assuming GDAL were capable of performing aggregations, it would be useful > to describe to GDAL (or have it otherwise deduce) the periodicity of datasets > and then itself generate missing time slices virtually. These slices would > just be filled with the "no data" value. > > It would also be useful for the API provide an abstraction for the concept of > periodicity, so that a caller can get/set it. I fully recognize this is > difficult, in that there are so many ways that datasets can be periodic, and > that it might not be possible to invent a scheme for representing them all. > But there are many datasets that fall under common schemes (e.g. one image > every day, 8-days, month, year, etc.) and this might be a situation in which > a simple solution can cover 95% of the datasets out there and provide good > value. > > A lot of these issues about time have been considered before by others, e.g. > by those developing netCDF metadata standards. This raises the question: > should GDAL even consider getting involved in this area? Isn't it really > tricky? Yes, probably. But if GDAL treats time completely generically, the > new API might not provide that much value of what can already be done, e.g. > with subdatasets. (But I do recognize that this is a potential minefield.) > > 3. Lazy evaluation: > > If an aggregate dataset is spread across many files, it would be good if GDAL > could avoid iterating over all of those files unless absolutely necessary. > For example, it would be unfortunate if GDAL could only determine the time or > depth coordinate of a file by opening it and reading a netCDF attribute. > Consider a 3D dataset of daily SST images composed of 10,000 netCDF files. It > would be very slow if GDAL had to open each one in order to return an array > of time dimension values. In some cases this might be unavoidable, but GDAL > should provide the means to avoid it, if possible, e.g. by telling GDAL how > to parse the times from the file names. You are probably well aware of these > issues in developing the .vrt driver and similar projects. > > I hope this helps. Thanks for your time and consideration. > > Jason > >> -----Original Message----- >> From: gdal-dev <gdal-dev-boun...@lists.osgeo.org> On Behalf Of Even >> Rouault >> Sent: Friday, May 24, 2019 11:01 AM >> To: gdal-dev@lists.osgeo.org >> Subject: [gdal-dev] "RFC 75: Multidimensional arrays" available for >> preliminary review >> >> Hi, >> >> I've prepared a preliminary version of a new RFC to add support for >> multidimensional arrays >> >> https://urldefense.proofpoint.com/v2/url?u=https- >> 3A__github.com_rouault_gdal_blob_rfc75- >> 5Ftext_gdal_doc_source_development_rfc_rfc75-5Fmultidimensional- >> 5Farrays.rst&d=DwIGaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=cJfJ >> 4ejc1xbg_qb47Pg1OoRq1GfFGvWbDD2PT7- >> fBKk&m=z4DIC0ByQfvEPOoQULwLWJUCYSJ_rD9wnK56bSk-088&s=QoB- >> _CuLOE_hxDlvaGk9pf_1E6LVycCraOPSTwANhCk&e= >> >> If you want to comment it inline, it is also available as a pull request >> at https://urldefense.proofpoint.com/v2/url?u=https- >> 3A__github.com_OSGeo_gdal_pull_1580&d=DwIGaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1 >> i6YHLR0Sj_gZ4adc&r=cJfJ4ejc1xbg_qb47Pg1OoRq1GfFGvWbDD2PT7- >> fBKk&m=z4DIC0ByQfvEPOoQULwLWJUCYSJ_rD9wnK56bSk- >> 088&s=5ZZMNYskn0v4Jzk4QvfTKYSArMmQZEZTTiH3uxtCVAI&e= >> >> This is preliminary content to give an idea of the general directions I >> have in mind for now. >> Not backed by any implementation yet (should start soon hopefully), so API >> details will likely change, but early feedback is welcome. >> >> Even >> >> -- >> Spatialys - Geospatial professional services >> https://urldefense.proofpoint.com/v2/url?u=http- >> 3A__www.spatialys.com&d=DwIGaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4a >> dc&r=cJfJ4ejc1xbg_qb47Pg1OoRq1GfFGvWbDD2PT7- >> fBKk&m=z4DIC0ByQfvEPOoQULwLWJUCYSJ_rD9wnK56bSk-088&s=vEHFxJLIKrMqTFFYqD_x- >> oa1BtTCt8ppUhHU1HakMoA&e= >> _______________________________________________ >> gdal-dev mailing list >> gdal-dev@lists.osgeo.org >> https://urldefense.proofpoint.com/v2/url?u=https- >> 3A__lists.osgeo.org_mailman_listinfo_gdal- >> 2Ddev&d=DwIGaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=cJfJ4ejc1xb >> g_qb47Pg1OoRq1GfFGvWbDD2PT7- >> fBKk&m=z4DIC0ByQfvEPOoQULwLWJUCYSJ_rD9wnK56bSk- >> 088&s=tUoKHatLSi1xtfDf1EYAGTrYwf99aV9Ky_Sa95QU54w&e= > _______________________________________________ > gdal-dev mailing list > gdal-dev@lists.osgeo.org > https://lists.osgeo.org/mailman/listinfo/gdal-dev -- Dr. Peter Baumann - Professor of Computer Science, Jacobs University Bremen www.faculty.jacobs-university.de/pbaumann mail: p.baum...@jacobs-university.de tel: +49-421-200-3178, fax: +49-421-200-493178 - Executive Director, rasdaman GmbH Bremen (HRB 26793) www.rasdaman.com, mail: baum...@rasdaman.com tel: 0800-rasdaman, fax: 0800-rasdafax, mobile: +49-173-5837882 "Si forte in alienas manus oberraverit hec peregrina epistola incertis ventis dimissa, sed Deo commendata, precamur ut ei reddatur cui soli destinata, nec preripiat quisquam non sibi parata." (mail disclaimer, AD 1083) _______________________________________________ gdal-dev mailing list gdal-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/gdal-dev