Hi Stephen, > > I'm looking for efficient ways to subtotal a 1-d array onto a 2-D > grid. This > is more easily explained in code that words, thus: > > for n in xrange(len(data)): > totals[ i[n], j[n] ] += data[n] > > data comes from a series of PyTables files with ~200m rows. Each row > has ~20 > cols, and I use the first three columns (which are 1-3 char strings) > to form > the indexing functions i[] and j[], then want to calc averages of the > remaining 17 numerical cols. > > I have tried various indirect ways of doing this with searchsorted and > bincount, but intuitively they feel overly complex solutions to what > is > essentially a very simple problem. > > My work involved comparing the subtotals for various different > segmentation > strategies (the i[] and j[] indexing functions). Efficient solutions > are > important because I need to make many passes through the 200m rows of > data. > Memory usage is the easiest thing for me to adjust by changing how > many rows > of data to read in for each pass and then reusing the same array data > buffers. >
It looks as if the values in your i and j columns come from a limited range, so you may consider encoding pairs of (i,j) values into one int using a suitable encoding function (e.g. ij = i+K*j if both i and j are non-negative and K=max(i)+1). You could then use bincount(ij, data) to get the sums per encoded (i,j) pair. This should be efficient, and the complexity is only in the encoding/decoding steps. Best regards, Andreas ---- Dr. Andreas Eisele Senior Researcher DFKI GmbH, Language Technology Lab [EMAIL PROTECTED] Stuhlsatzenhausweg 3 tel: +49-681-302-5285 D-66123 Saarbrücken fax: +49-681-302-5338 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion