Thanks Francesc, but I am already planning to read the data in block-wise as you suggest. My question is rather how best to update the subtotals for each block in a parallel way using numpy efficiently, rather than a simplistic and slow element-by-element loop.
I can't use a simple sum(), as in your benchmark example or Greg's reply, because I need to do: for n in xrange(len(data)): totals[ i[n], j[n] ] += data[n] and not for n in xrange(len(data)): totals[n] += data[n] My best solution so far is roughly like this: - read in next block of 100k or so rows (taking into account the PyTables table's _v_maxTuples and _v_chunksize) - calculate the subtotal index arrays i and j - do a lexsort() on [i, j, n] - partition the sorted [i, j, n] into subsets where the i and j arrays change values. The k_th such subset is thus s_k = [ i_k, j_k, [n_k0, ..., n_kN] ] - update the subtotals for each subset in the block totals[i_k, j_k]+= sum(data[n_k0, ..., n_kN]) This should be reasonably efficient, but it's messy, and I'm not familar enough with numpy's indexing tricks to get this right first time. Maybe instead I'll have a go at writing a Pyrex function that implements the simple loop at C speed: subtotal2d(data_array, idx_array, out=None, dtype=None) where data_array is Nx1, idx_array is NxM and out is M-dimensional. Incidentally, there's one other function I'd find useful here in forming the index arrays i[] and j[], a fast translate-from-dict function: arr2 = fromiter( d[a] for a in arr1 ) My initial impression is that a C version would be substantially faster; maybe I should do some benchmarking to see whether a pure Python/numpy approach is actually faster than I expect. Cheers, and thanks for any further suggestions, Stephen Francesc Altet <[EMAIL PROTECTED]> wrote: > A Divendres 29 Desembre 2006 10:05, Stephen Simmons escrigué: > > Hi, > > > > I'm looking for efficient ways to subtotal a 1-d array onto a 2-D grid. > > This is more easily explained in code that words, thus: > > > > for n in xrange(len(data)): > > totals[ i[n], j[n] ] += data[n] > > > > data comes from a series of PyTables files with ~200m rows. Each row has > > ~20 cols, and I use the first three columns (which are 1-3 char strings) to > > form the indexing functions i[] and j[], then want to calc averages of the > > remaining 17 numerical cols. > > > > I have tried various indirect ways of doing this with searchsorted and > > bincount, but intuitively they feel overly complex solutions to what is > > essentially a very simple problem. > > > > My work involved comparing the subtotals for various different segmentation > > strategies (the i[] and j[] indexing functions). Efficient solutions are > > important because I need to make many passes through the 200m rows of data. > > Memory usage is the easiest thing for me to adjust by changing how many > > rows of data to read in for each pass and then reusing the same array data > > buffers. > > Well, from your words I guess you should already have tested this, but just in > > case. As PyTables saves data in tables row-wise, it is always faster using > the complete row for computations in each iteration than using just a single > column. This is shown in the small benchmark that I'm attaching at the end of > the message. Here is its output for a table with 1m rows: > > time for creating the file--> 12.044 > time for using column reads --> 46.407 > time for using the row wise iterator--> 73.036 > time for using block reads (row wise)--> 5.156 > > So, using block reads (in case you can use them) is your best bet. > > HTH, > > -------------------------------------------------------------------------------------- > import tables > import numpy > from time import time > > nrows = 1000*1000 > > # Create a table definition with 17 double cols and 3 string cols > coltypes = numpy.dtype("f8,"*17 + "S3,"*3) > > t1 = time() > # Create a file with an empty table. Use compression to minimize file size. > f = tables.openFile("/tmp/prova.h5", 'w') > table = f.createTable(f.root, 'table', numpy.empty(0, coltypes), > filters=tables.Filters(complevel=1, complib='lzo')) > # Fill the table with default values (empty strings and zeros) > row = table.row > for nrow in xrange(nrows): > row.append() > f.close() > print "time for creating the file-->", round(time()-t1, 3) > > # *********** Start benchmarks ************************** > f = tables.openFile("/tmp/prova.h5", 'r') > table = f.root.table > colnames = table.colnames[:-3] # exclude the string cols > > # Loop over the table using column reads > t1 = time(); cum = numpy.zeros(17) > for ncol, colname in enumerate(colnames): > col = table.read(0, nrows, field=colname) > cum[ncol] += col.sum() > print "time for using column reads -->", round(time()-t1, 3) > > # Loop over the table using its row iterator > t1 = time(); cum = numpy.zeros(17) > for row in table: > for ncol, colname in enumerate(colnames): > cum[ncol] += row[colname] > print "time for using the row iterator-->", round(time()-t1, 3) > > # Loop over the table using block reads (row wise) > t1 = time(); cum = numpy.zeros(17) > step = 10000 > for nrow in xrange(0, nrows, step): > ra = table[nrow:nrow+step] > for ncol, colname in enumerate(colnames): > cum[ncol] += ra[colname].sum() > print "time for using block reads (row wise)-->", round(time()-t1, 3) > > f.close() _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion