thanks for your responses, because of the size of the dataset I will still end up with the memory error if I calculate the median for each file, additionally the files are not all the same size. I believe this memory problem will still arise with the cumulative distribution calculation and not sure I understand how to write the second suggestion about the iterative approach but will have a go. Thanks again
On Wed, Jan 25, 2012 at 1:26 PM, Brett Olsen <brett.ol...@gmail.com> wrote: > On Tue, Jan 24, 2012 at 6:22 PM, questions anon > <questions.a...@gmail.com> wrote: > > I need some help understanding how to loop through many arrays to > calculate > > the 95th percentile. > > I can easily do this by using numpy.concatenate to make one big array and > > then finding the 95th percentile using numpy.percentile but this causes a > > memory error when I want to run this on 100's of netcdf files (see code > > below). > > Any alternative methods will be greatly appreciated. > > > > > > all_TSFC=[] > > for (path, dirs, files) in os.walk(MainFolder): > > for dir in dirs: > > print dir > > path=path+'/' > > for ncfile in files: > > if ncfile[-3:]=='.nc': > > print "dealing with ncfiles:", ncfile > > ncfile=os.path.join(path,ncfile) > > ncfile=Dataset(ncfile, 'r+', 'NETCDF4') > > TSFC=ncfile.variables['T_SFC'][:] > > ncfile.close() > > all_TSFC.append(TSFC) > > > > big_array=N.ma.concatenate(all_TSFC) > > Percentile95th=N.percentile(big_array, 95, axis=0) > > If the range of your data is known and limited (i.e., you have a > comparatively small number of possible values, but a number of repeats > of each value) then you could do this by keeping a running cumulative > distribution function as you go through each of your files. For each > file, calculate a cumulative distribution function --- at each > possible value, record the fraction of that population strictly less > than that value --- and then it's straightforward to combine the > cumulative distribution functions from two separate files: > cumdist_both = (cumdist1 * N1 + cumdist2 * N2) / (N1 + N2) > > Then once you've gone through all the files, look for the value where > your cumulative distribution function is equal to 0.95. If your data > isn't structured with repeated values, though, this won't work, > because your cumulative distribution function will become too big to > hold into memory. In that case, what I would probably do would be an > iterative approach: make an approximation to the exact function by > removing some fraction of the possible values, which will provide a > limited range for the exact percentile you want, and then walk through > the files again calculating the function more exactly within the > limited range, repeating until you have the value to the desired > precision. > > ~Brett > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion