Hi,

I have a large (global, 30m resolution, 50GB+) GeoTIFF dataset, from which I 
need to read many (millions) of pixel values at given input coordinates. I've 
got reasonable performance out of the code, about a million queries over five 
minutes, but:


  1.  There are actually twelve separate datasets of this size to query, not 
just one, so it takes approximately an hour.
  2.  This is by far the slowest portion of the program, and the users demand 
speed!
  3.  The users would also like to move towards higher resolution datasets, 
which we see run about 5x slower.
  4.  When querying the data on a particular piece of network storage mounted 
as part of the local filesystem, we see a slowdown approaching two orders of 
magnitude - bulk file copies off the network storage are reasonable, but each 
IO request shows a significant overhead (up to a second), and GDAL is sending 
one for each coordinate queried.

The implementation is in Python, directly calling down to GDAL. The short, 
long-running snippet of code which performs the actual queries the dataset, 
having converted real-world coordinates to pixels, is:

value_arrays = (
            raster_ds.ReadAsArray(
                        xoff=coord[0] - buffer_size,
                        yoff=coord[1] - buffer_size,
                        xsize=npix,
                        ysize=npix
            ) for coord in offsets
)

There are a few things that are probably worth noting:

  1.  It is not necessarily a single pixel that is being read - for each 
coordinate, the program may be asked to get all pixel values within a given 
radius (typically a couple of pixels), and use some function to summarise these 
into a single value (mean, median, ...). GDAL currently returns a numpy array 
for each query, which is passed to the user-specified function after the 
snippet above.
  2.  The dataset is made up of 2048x2048 LZW-Compressed tiles containing 
floating point data (essentially conforming to COG, but with no overviews), 
grouped together in a VRT (performance is identical with plain GeoTIFFs, 
though).

Multiprocessing has not been found to help - we actually lose throughput as the 
disk read head is moving back and forth constantly. Better hardware (especially 
SSDs) is known to help, but no one wants to pay for that.

We see no particular performance difference from setting 
GDAL_DISABLE_READDIR_ON_OPEN=TRUE, and GDAL_CACHEMAX is left at the default 5% 
(64GB+ RAM available).



Does the Python interface to GDAL provide a way to supply a large number of 
offsets and get blocks of pixels back, avoiding the need to come back up to 
Python after each query? (I suspect not)
Is there some way to optimise GDAL so that queries of files on the mounted 
network storage are more efficient?



Dr. Daniel Evans
Software Developer

Skype<sip:daniel.ev...@jbarisk.com>


T +44 (0) 1756 799919
www.jbarisk.com<http://www.jbarisk.com>

[Visit our website]<http://www.jbarisk.com> 
[http://www.jbagroup.co.uk/imgstore/JBA-Email-Sig-Icons-LINKEDIN.png]  [Follow 
us on Twitter] <https://twitter.com/jbarisk>
Our postal address and registered office is JBA Risk Management Limited, 1 
Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire BD23 3FD.

Find out more about us here: www.jbarisk.com<http://www.jbarisk.com/> and 
follow us on Twitter @JBARisk<http://twitter.com/JBARisk> and 
LinkedIn<https://www.linkedin.com/company/2370847?trk=tyah&trkInfo=clickedVertical%3Acompany%2CclickedEntityId%3A2370847%2Cidx%3A2-1-2%2CtarId%3A1447414259786%2Ctas%3AJBA%20RISK%20MANAGEMENT>

The JBA Group supports the JBA Trust.

All JBA Risk Management's email messages contain confidential information and 
are intended only for the individual(s) named. If you are not the named 
addressee you should not disseminate, distribute or copy this e-mail.
Please notify the sender immediately by email if you have received this email 
by mistake and delete this email from your system.


JBA Risk Management Limited is registered in England, company number 07732946, 
1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire, BD23 
3FD, Telephone: +441756799919


_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Reply via email to