Hi all, First email to the list for me, I just want to say how grateful I am to have python+numpy+ipython etc... for my day to day needs. Great combination of software.
Anyway, I've been having this bottleneck in one my algorithms that has been bugging me for quite a while. The objective is to speed this part up. I've been doing tons of optimization and parallel processing around that piece of code to get a decent run time. The problem is easy. You want to accumulate in a matrix, a weighted sum of other matrices. Let's call this function scale_and_add: def scale_and_add_re(R,w,Ms): (nb_add,mdim,j)=np.shape(Ms) for i in range(nb_add): R+=w[i]*Ms[i] return R This 'for' loop bugs me since I know this will slow things down. But the dimension of these things are so large that any attempt to vectorize this is slower and takes too much memory. I typically work with 1000 weights and matrices, matrices of dimension of several hundred. My current config is: In [2]: %timeit scale_and_add_re(R,w,Ms) 1 loops, best of 3: 392 ms per loop In [3]: w.shape Out[3]: (2000,) In [4]: Ms.shape Out[4]: (2000, 200, 200) I'd like to be able to double these dimensions. For instance I could use broadcasting by using a dot product %timeit dot(Ms.T,w) 1 loops, best of 3: 1.77 s per loop But this is i) slower ii) takes too much memory (btw, I'd really need an inplace dot-product in numpy to avoid the copy in memory, like the blas call in scipy.linalg. But that's for an other thread...) The matrices are squared and symmetric. I should be able to get something out of this, but I never found anything related to this in Numpy. I also tried a Cython reimplementation %timeit scale_and_add_reg(L1,w,Ms) 1 loops, best of 3: 393 ms per loop It brought nothing in speed. Here's the code @cython.boundscheck(False) def scale_and_add_reg(np.ndarray[float, ndim=2] R, np.ndarray[float, ndim=1] w, np.ndarray[float, ndim=3] Ms): return _scale_and_add_reg(R,w,Ms) @cython.boundscheck(False) cdef int _scale_and_add_reg(np.ndarray[float, ndim=2] R, np.ndarray[float, ndim=1] w, np.ndarray[float, ndim=3] Ms): cdef unsigned int mdim cdef int nb_add (nb_add,mdim,j)=np.shape(Ms) cdef unsigned int i for i from 0 <= i < nb_add: R+=w[i]*Ms[i] #for j in range(mdim): # for k in range(mdim): # R[j][k]+=w[i]*Ms[i][j][k] return 0 So here's my question if someone has time to answer it: Did I try anything possible? Should I move along and deal with this bottleneck? Or is there something else I didn't think of? Thanks for reading, keep up the great work! -n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion