Wed, 20 Jul 2011 09:04:09 +0000, Pauli Virtanen wrote: > Wed, 20 Jul 2011 08:49:21 +0200, Carlos Becker wrote: >> Those are very interesting examples. I think that pre-allocation is >> very important, and something similar happens in Matlab if no >> pre-allocation is done: it takes 3-4x longer than with pre-allocation. >> The main difference is that Matlab is able to take into account a >> pre-allocated array/matrix, probably avoiding the creation of a >> temporary and writing the results directly in the pre-allocated array. > > You have not demonstrated that the difference you have comes from > pre-allocation. > > If it would come from pre-allocation, how come I get the same speed as > an equivalent C implementation, which *does* pre-allocation, using > exactly the same benchmark codes as you have posted?
Ok, I looked at it at a different machine, and in the end I'll have to agree with you :) Some interesting data points: on my Eee laptop (with the test codes below), I get Numpy: 0.0771 Numpy (preallocated): 0.0383 On a different machine, on the other hand: Numpy: 0.0288 Numpy (preallocated): 0.0283 For larger array sizes the situation starts to change (4000x4000): Numpy (allocation & zeroing): 0.161 Numpy: 0.1953 Numpy (preallocated): 0.1153 Also interestingly, Zeroing (memset, per element): 1.04313e-08 # 4000x4000 array Zeroing (memset, per element): 1.05333e-08 # 3000x3000 array Zeroing (memset, per element): 1.04427e-08 # 2048x2048 array Zeroing (memset, per element): 2.24223e-09 # 2048x2047 array Zeroing (memset, per element): 2.1e-09 # 2000x2000 array Zeroing (memset, per element): 1.75e-09 # 200x200 array Zeroing (memset, preallocated, per element): 2.06667e-09 # 3000x3000 Zeroing (memset, preallocated, per element): 2.0504e-09 # 2048x2048 Zeroing (memset, preallocated, per element): 1.94e-09 # 200x200 There is a sharp order-of-magnitude change of speed in malloc+memset of an array, which is not present in memset itself. (This is then also reflected in the Numpy performance -- floating point operations probably don't cost much compared to memory access speed.) It seems that either the kernel or the C library changes the way it handles allocation at that point. So yes, it seems that you were right after all: for large arrays preallocation may be an issue, but the exact size limit depends on the machine etc. in question, which is why I didn't manage to see this at first. *** In this particular case, we are somewhat limited by Python on what optimizations we can do. It is not possible to have the expression "k = m - 0.5" be translated into `np.subtract(m, 0.5, k)` instead of `k = np.subtract(m, 0.5)` because this translation is done by Python itself. If the translation is taken away from Python, e.g., by switching to lazy evaluation or via Numexpr, then things can be improved. There have been some ideas around on implementing matrix algebra lazily in this way, with the ability of reusing temporary buffers. The issue of reusing temporaries in expressions such as `a = 0.3*x + y + z` should however be possible to address within Numpy. Pauli ---------------------------------------------------- #include <stdlib.h> #include <time.h> #include <stdio.h> #include <string.h> int main() { double *a, *b; int N = 2000*2000, M=100; int j; int k; clock_t start, end; a = (double*)malloc(sizeof(double)*N); b = (double*)malloc(sizeof(double)*N); start = clock(); for (k = 0; k < M; ++k) { memset(b, '\0', sizeof(double)*N); } end = clock(); printf("Zeroing (memset, preallocated, per element): %g\n", ((double)(end-start))/CLOCKS_PER_SEC/M/N); free(b); start = clock(); for (k = 0; k < M; ++k) { b = (double*)malloc(sizeof(double)*N); memset(b, '\0', sizeof(double)*N); free(b); } end = clock(); printf("Zeroing (memset, per element): %g\n", ((double)(end-start))/CLOCKS_PER_SEC/M/N); b = (double*)malloc(sizeof(double)*N); start = clock(); for (k = 0; k < M; ++k) { for (j = 0; j < N; ++j) { b[j] = a[j] - 0.5; } } end = clock(); printf("Operation in C (preallocated): %g\n", ((double)(end-start))/CLOCKS_PER_SEC/M); free(b); start = clock(); for (k = 0; k < M; ++k) { b = (double*)malloc(sizeof(double)*N); for (j = 0; j < N; ++j) { b[j] = a[j] - 0.5; } free(b); } end = clock(); printf("Operation in C: %g\n", ((double)(end-start))/CLOCKS_PER_SEC/M); return 0; } ---------------------------------------------------- import time import numpy as np print np.__version__, np.__file__ m = np.ones([2000, 2000],float) N = 100 print (m.size * m.dtype.itemsize) / 1e6 t1 = time.clock() for x in range(N): k = np.zeros_like(m) t2 = time.clock() print "Numpy (allocation & zeroing):", (t2 - t1) / N t1 = time.clock() for x in range(N): k = m - 0.5 t2 = time.clock() print "Numpy:", (t2 - t1) / N t1 = time.clock() for x in range(N): np.subtract(m, 0.5, k) t2 = time.clock() print "Numpy (preallocated):", (t2 - t1) / N _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion