Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-14 Thread Cameron Walsh
Using Eric's latest speed-testing, here's David's results: [EMAIL PROTECTED]:~/code_snippets/histogram$ python histogram_speed.py type: uint8 millions of elements: 100.0 sec (C indexing based): 8.44 1 sec (numpy iteration based): 8.91 1 sec (rick's pure python): 6.4 1 sec (

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-14 Thread David Huard
Hi, I spent some time a while ago on an histogram function for numpy. It uses digitize and bincount instead of sorting the data. If I remember right, it was significantly faster than numpy's histogram, but I don't know how it will behave with very large data sets. I attached the file if you want

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-14 Thread eric jones
I just noticed a bug in this code. "PyArray_ITER_NEXT(iter);" should be moved out of the if statement. eric eric jones wrote: > > > Rick White wrote: >> Just so we don't get too smug about the speed, if I do this in IDL >> on the same machine it is 10 times faster (0.28 seconds instead of >>

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-14 Thread eric jones
Rick White wrote: Just so we don't get too smug about the speed, if I do this in IDL on the same machine it is 10 times faster (0.28 seconds instead of 4 seconds). I'm sure the IDL version uses the much faster approach of just sweeping through the array once, incrementing counts in the

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-14 Thread Brian Granger
This same idea could be used to parallelize the histogram computation. Then you could really get into large (many Gb/TB/PB) data sets. I might try to find time to do this with ipython1, but someone else could do this as well. Brian On 12/13/06, Rick White <[EMAIL PROTECTED]> wrote: > On Dec 12,

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-14 Thread Rick White
On Dec 14, 2006, at 2:56 AM, Cameron Walsh wrote: > At some point I might try and test > different cache sizes for different data-set sizes and see what the > effect is. For now, 65536 seems a good number and I would be happy to > see this replace the current numpy.histogram. I experimented a li

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-13 Thread Cameron Walsh
Hi all, Absolutely gorgeous, I confirm the 1.6x speed-up over the weave version, i.e. a 25x speed-up over the existing version. It would be good if the redefinition of the range function could be changed in the numpy modules, before it goes into subversion, to avoid the need for Rick's line lran

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-13 Thread eric jones
Looks to me like Rick's version is simpler and faster.It looks like it offers a speed-up of about 1.6 on my machine over the weave version. I believe this is because the sorting approach results in quite a few less compares than the algorithm I used. Very cool. I vote that his version go int

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-13 Thread Rick White
On Dec 12, 2006, at 10:27 PM, Cameron Walsh wrote: > I'm trying to generate histograms of extremely large datasets. I've > tried a few methods, listed below, all with their own shortcomings. > Mailing-list archive and google searches have not revealed any > solutions. The numpy.histogram functio

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-13 Thread eric jones
Glad to here it worked for you. see ya, eric Cameron Walsh wrote: > Thanks very much, Eric. That line fixed it for me, although I'm still > not sure why it broke with the last line. > > Your weave_histogram works a charm and is around 16 times faster than > any of the other options I've tried.

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-13 Thread Cameron Walsh
Thanks very much, Eric. That line fixed it for me, although I'm still not sure why it broke with the last line. Your weave_histogram works a charm and is around 16 times faster than any of the other options I've tried. On my laptop it took 30 seconds to generate a histogram from 500 million numb

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-13 Thread eric jones
Hmmm. ? Not sure. ? Change that line to this instead which should work as well. code = array_converter.declaration_code(self, templatize, inline) Both work for me. eric Cameron Walsh wrote: > On 13/12/06, Cameron Walsh <[EMAIL PROTECTED]> wrote: > >> On 13/12/06, eric jones <[EMAI

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-13 Thread Cameron Walsh
On 13/12/06, Cameron Walsh <[EMAIL PROTECTED]> wrote: > On 13/12/06, eric jones <[EMAIL PROTECTED]> wrote 290 lines of > awesome code and a fantastic explanation: > > > Hey Cameron, > > > > I wrote a simple weave based histogram function that should work for > > your problem. It should work for an

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-13 Thread Cameron Walsh
On 13/12/06, eric jones <[EMAIL PROTECTED]> wrote 290 lines of awesome code and a fantastic explanation: > Hey Cameron, > > I wrote a simple weave based histogram function that should work for > your problem. It should work for any array input data type. The needed > files (and a few tests and e

Re: [Numpy-discussion] Histograms of extremely large data sets

2006-12-12 Thread eric jones
Hey Cameron, I wrote a simple weave based histogram function that should work for your problem. It should work for any array input data type. The needed files (and a few tests and examples) are attached. Below is the output from the histogram_speed.py file attached. The test takes about 1

[Numpy-discussion] Histograms of extremely large data sets

2006-12-12 Thread Cameron Walsh
Hi all, I'm trying to generate histograms of extremely large datasets. I've tried a few methods, listed below, all with their own shortcomings. Mailing-list archive and google searches have not revealed any solutions. Method 1: import numpy import matplotlib data=numpy.empty((489,1000,1000),dt