On Tue, Jun 21, 2011 at 9:46 AM, Zachary Pincus <[email protected]> wrote:
> Hello all,
>
> As a result of the "fast greyscale conversion" thread, I noticed an anomaly
> with numpy.ndararray.sum(): summing along certain axes is much slower with
> sum() than versus doing it explicitly, but only with integer dtypes and when
> the size of the dtype is less than the machine word. I checked in 32-bit and
> 64-bit modes and in both cases only once the dtype got as large as that did
> the speed difference go away. See below...
>
> Is this something to do with numpy or something inexorable about machine /
> memory architecture?
>
> Zach
>
> Timings -- 64-bit mode:
> ----------------------
> In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
> In [3]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 131 ms per loop
> In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 2.57 ms per loop
>
> In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
> In [6]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 131 ms per loop
> In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 4.75 ms per loop
>
> In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
> In [9]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 131 ms per loop
> In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 6.37 ms per loop
>
> In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
> In [12]: timeit i.sum(axis=-1)
> 100 loops, best of 3: 16.6 ms per loop
> In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 15.1 ms per loop
>
>
>
> Timings -- 32-bit mode:
> ----------------------
> In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
> In [3]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 138 ms per loop
> In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 3.68 ms per loop
>
> In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
> In [6]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 140 ms per loop
> In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 4.17 ms per loop
>
> In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
> In [9]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 22.4 ms per loop
> In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 12.2 ms per loop
>
> In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
> In [12]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 29.2 ms per loop
> In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 10 loops, best of 3: 23.8 ms per loop
One difference is that i.sum() changes the output dtype of int input
when the int dtype is less than the default int dtype:
>> i.dtype
dtype('int32')
>> i.sum(axis=-1).dtype
dtype('int64') # <-- dtype changed
>> (i[...,0]+i[...,1]+i[...,2]+i[...,3]).dtype
dtype('int32')
Here are my timings
>> i = numpy.ones((1024,1024,4), numpy.int32)
>> timeit i.sum(axis=-1)
1 loops, best of 3: 278 ms per loop
>> timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 12.1 ms per loop
>> import bottleneck as bn
>> timeit bn.func.nansum_3d_int32_axis2(i)
100 loops, best of 3: 8.27 ms per loop
Does making an extra copy of the input explain all of the speed
difference (is this what np.sum does internally?):
>> timeit i.astype(numpy.int64)
10 loops, best of 3: 29.2 ms per loop
No.
Initializing the output also adds some time:
>> timeit np.empty((1024,1024,4), dtype=np.int32)
100000 loops, best of 3: 2.67 us per loop
>> timeit np.empty((1024,1024,4), dtype=np.int64)
100000 loops, best of 3: 12.8 us per loop
Switching back and forth between the input and output array takes more
"memory" time too with int64 arrays compared to int32.
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion