On Tue, Jun 21, 2011 at 11:17 AM, Keith Goodman <[email protected]> wrote:
> On Tue, Jun 21, 2011 at 9:46 AM, Zachary Pincus <[email protected]> > wrote: > > Hello all, > > > > As a result of the "fast greyscale conversion" thread, I noticed an > anomaly with numpy.ndararray.sum(): summing along certain axes is much > slower with sum() than versus doing it explicitly, but only with integer > dtypes and when the size of the dtype is less than the machine word. I > checked in 32-bit and 64-bit modes and in both cases only once the dtype got > as large as that did the speed difference go away. See below... > > > > Is this something to do with numpy or something inexorable about machine > / memory architecture? > > > > Zach > > > > Timings -- 64-bit mode: > > ---------------------- > > In [2]: i = numpy.ones((1024,1024,4), numpy.int8) > > In [3]: timeit i.sum(axis=-1) > > 10 loops, best of 3: 131 ms per loop > > In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 100 loops, best of 3: 2.57 ms per loop > > > > In [5]: i = numpy.ones((1024,1024,4), numpy.int16) > > In [6]: timeit i.sum(axis=-1) > > 10 loops, best of 3: 131 ms per loop > > In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 100 loops, best of 3: 4.75 ms per loop > > > > In [8]: i = numpy.ones((1024,1024,4), numpy.int32) > > In [9]: timeit i.sum(axis=-1) > > 10 loops, best of 3: 131 ms per loop > > In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 100 loops, best of 3: 6.37 ms per loop > > > > In [11]: i = numpy.ones((1024,1024,4), numpy.int64) > > In [12]: timeit i.sum(axis=-1) > > 100 loops, best of 3: 16.6 ms per loop > > In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 100 loops, best of 3: 15.1 ms per loop > > > > > > > > Timings -- 32-bit mode: > > ---------------------- > > In [2]: i = numpy.ones((1024,1024,4), numpy.int8) > > In [3]: timeit i.sum(axis=-1) > > 10 loops, best of 3: 138 ms per loop > > In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 100 loops, best of 3: 3.68 ms per loop > > > > In [5]: i = numpy.ones((1024,1024,4), numpy.int16) > > In [6]: timeit i.sum(axis=-1) > > 10 loops, best of 3: 140 ms per loop > > In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 100 loops, best of 3: 4.17 ms per loop > > > > In [8]: i = numpy.ones((1024,1024,4), numpy.int32) > > In [9]: timeit i.sum(axis=-1) > > 10 loops, best of 3: 22.4 ms per loop > > In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 100 loops, best of 3: 12.2 ms per loop > > > > In [11]: i = numpy.ones((1024,1024,4), numpy.int64) > > In [12]: timeit i.sum(axis=-1) > > 10 loops, best of 3: 29.2 ms per loop > > In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > > 10 loops, best of 3: 23.8 ms per loop > > One difference is that i.sum() changes the output dtype of int input > when the int dtype is less than the default int dtype: > > >> i.dtype > dtype('int32') > >> i.sum(axis=-1).dtype > dtype('int64') # <-- dtype changed > >> (i[...,0]+i[...,1]+i[...,2]+i[...,3]).dtype > dtype('int32') > > Here are my timings > > >> i = numpy.ones((1024,1024,4), numpy.int32) > >> timeit i.sum(axis=-1) > 1 loops, best of 3: 278 ms per loop > >> timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 12.1 ms per loop > >> import bottleneck as bn > >> timeit bn.func.nansum_3d_int32_axis2(i) > 100 loops, best of 3: 8.27 ms per loop > > Does making an extra copy of the input explain all of the speed > difference (is this what np.sum does internally?): > > >> timeit i.astype(numpy.int64) > 10 loops, best of 3: 29.2 ms per loop > > No. > > I think you can see the overhead here: In [14]: timeit einsum('ijk->ij', i, dtype=int32) 100 loops, best of 3: 17.6 ms per loop In [15]: timeit einsum('ijk->ij', i, dtype=int64) 100 loops, best of 3: 18 ms per loop In [16]: timeit einsum('ijk->ij', i, dtype=int16) 100 loops, best of 3: 18.3 ms per loop In [17]: timeit einsum('ijk->ij', i, dtype=int8) 100 loops, best of 3: 5.87 ms per loop > Initializing the output also adds some time: > > >> timeit np.empty((1024,1024,4), dtype=np.int32) > 100000 loops, best of 3: 2.67 us per loop > >> timeit np.empty((1024,1024,4), dtype=np.int64) > 100000 loops, best of 3: 12.8 us per loop > > Switching back and forth between the input and output array takes more > "memory" time too with int64 arrays compared to int32. > Chuck
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
