On 20 December 2011 18:57, Dirk Rothe <thec...@gmail.com> wrote: > Hello Cython-Devs, > > I'v thought I check out the memoryview syntax from cython-trunk to > refactor some tight loops on numpy arrays into smaller functions. But > either I'm doing something wrong or the call-overhead (of dostuff() ) > is still very large. Am I missing something? > > @cython.boundscheck(False) > cdef inline int dostuff(np.int_t[:] data, int i, int j) nogil: > return data[j] + i + j > > @cython.boundscheck(False) > def test(): > cdef np.int_t[:, :] data = np.zeros((3000, 20000), dtype=np.int) > cdef int i, j > with nogil: > for i in range(3000): > for j in range(20000): > # try to be as fast > data[i, j] = dostuff(data[i], i, j) > # as direct array access > #~ data[i, j] = data[i, j] + i + j > > thnx, dirk
The performance difference is indeed quite large. There are several problems with the implementation of slices: 1) the overhead of PyThread_acquire_lock() is quite large, we should resort to an atomic approach 2) the slices support up to 32 dimensions by default (configurable as compiler option). This is a lot of memory to copy around all the time. I think a default of 8 would be more sensible and the compiler option should be documented well (who uses 32 dimensions anyway?) 3) the slice function has a generic approach and could be somewhat faster if the slice is direct and strided Addressing these problems by tweaking the generated code brings it down from ~16 seconds to ~2.4 seconds. The direct indexing approach without function call takes ~0.35 seconds. Slicing will never be as fast, so if you'd really want to write that code you'd move the data[i] call to the outer loop, as in: for i in range(3000): dataslice = data[i] for j in range(...): ... Now Cython could do that optimization itself as the 'data' slice does not change in the inner loop, but it doesn't. But at least it should not be more than 10 times slower (so this will be worked on). @cython-dev How should atomic operations be supported? Should this use something like http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html , or something like libatomic? Or should we "just" implement a garbage collector for pure-Cython level stuff (like memoryview slices), thereby avoiding the need to acquisition count? _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel