Update: I just noticed that using Eric's OpenMP code gave me only a 1.35x speedup when comparing 3 threads vs. my non OpenMP code. However, when comparing 3 threads vs. 1 thread, I could call this a 2.55x speedup. This obviously sounds much better, but is obviously not the number that matters... (Ergo Eric's 11x speedup on 12 cores might also be more like 5x or 6x when compared to simple C. w/o any OpenMP overhead).
Finally, I have to report that when plugging the C code (from scipy.spatial.distance.cdist) into my real application, which compares for one sample file point pairs from a sequence of 1000 microscopy images, the speed gain appears to be a total of 2x (only !!) -- compared to my original app that used the simple numpy version implemented like this: def pyDist2(a_ps,b_ps, ab_dist): ab_dist = N.zeros(shape=(a_ps.shape[0],b_ps.shape[0]), dtype= N.float64) for i in range(a_ps.shape[0]): d = b_ps- a_ps[i] ab_dist[i] = N.sqrt( N.sum(d**2, -1) ) return ab_dist (The test I was using before was just testing one pair of two arrays of length 329 and 340, whereas the average length for the 1000 images is 260 It could also be that I have substantial constant over from some OpenGL lists that I'm generating in my app.) Nevertheless, I'm happy having learned something about OpenMP. I'm still trying to get more info from valgrind. And, maybe someone could comment, if SSE would help for the function in question. Thanks, Sebastian. On Wed, Feb 16, 2011 at 1:50 PM, Sebastian Haase <seb.ha...@gmail.com> wrote: > Eric, > this is amazing !! Thanks very much, I have rarely seen such a compact > source example that just worked. > The timings I get are: > c_threads 1 time 0.00155731916428 > c_threads 2 time 0.000829789638519 > c_threads 3 time 0.000616888999939 > c_threads 4 time 0.000704760551453 > c_threads 5 time 0.000933599472046 > c_threads 6 time 0.000809240341187 > c_threads 7 time 0.000837240219116 > c_threads 8 time 0.000817658901215 > c_threads 9 time 0.000843930244446 > c_threads 10 time 0.000861320495605 > c_threads 11 time 0.000936930179596 > c_threads 12 time 0.000847370624542 > > The optimum for my > Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz > seems to be at 3 threads with .6 ms for the given test case. > > I just reran my normal non-OpenMP C based code, and it takes > .84 ms (1.35x slower). > from scipy.spatial import distance > distance.cdist(a,b) takes > .9 ms -- which includes the allocation of the output array, because > there is no `out` option available. > > > So I'm happy that OpenMP works, > but apparently on my CPU the speed increase is not overwhelming (yet)... > > Thanks, > -- Sebastian > > > On Wed, Feb 16, 2011 at 4:50 AM, Eric Carlson <ecarl...@eng.ua.edu> wrote: >> I don't have the slightest idea what I'm doing, but.... >> >> >> >> ____ >> file name - the_lib.c >> ___ >> #include <stdio.h> >> #include <time.h> >> #include <omp.h> >> #include <math.h> >> >> void dists2d( double *a_ps, int na, >> double *b_ps, int nb, >> double *dist, int num_threads) >> { >> >> int i, j; >> int dynamic=0; >> omp_set_dynamic(dynamic); >> omp_set_num_threads(num_threads); >> double ax,ay, dif_x, dif_y; >> int nx1=2; >> int nx2=2; >> >> >> #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y) >> for(i=0;i<na;i++) >> { >> ax=a_ps[i*nx1]; >> ay=a_ps[i*nx1+1]; >> for(j=0;j<nb;j++) >> { dif_x = ax - b_ps[j*nx2]; >> dif_y = ay - b_ps[j*nx2+1]; >> dist[2*i+j] = sqrt(dif_x*dif_x+dif_y*dif_y); >> } >> } >> } >> >> >> ________ >> >> >> COMPILE: >> __________ >> gcc -c the_lib.c -fPIC -fopenmp -ffast-math >> gcc -shared -o the_lib.so the_lib.o -lgomp -lm >> >> >> ____ >> >> the_python_prog.py >> _____________ >> >> from ctypes import * >> my_lib=CDLL('the_lib.so') #or full path to lib >> import numpy as np >> import time >> >> na=329 >> nb=340 >> a=np.random.rand(na,2) >> b=np.random.rand(nb,2) >> c=np.zeros(na*nb) >> trials=100 >> max_threads = 24 >> for k in range(1,max_threads): >> n_threads =c_int(k) >> na2=c_int(na) >> nb2=c_int(nb) >> >> start = time.time() >> for k1 in range(trials): >> ret = >> my_lib.dists2d(a.ctypes.data_as(c_void_p),na2,b.ctypes.data_as(c_void_p),nb2,c.ctypes.data_as(c_void_p),n_threads) >> print "c_threads",k, " time ", (time.time()-start)/trials >> >> >> >> ____ >> Results on my machine, dual xeon, 12 cores >> na=329 >> nb=340 >> ____ >> >> 100 trials each: >> c_threads 1 time 0.00109949827194 >> c_threads 2 time 0.0005726313591 >> c_threads 3 time 0.000429179668427 >> c_threads 4 time 0.000349278450012 >> c_threads 5 time 0.000287139415741 >> c_threads 6 time 0.000252468585968 >> c_threads 7 time 0.000222821235657 >> c_threads 8 time 0.000206289291382 >> c_threads 9 time 0.000187981128693 >> c_threads 10 time 0.000172770023346 >> c_threads 11 time 0.000164999961853 >> c_threads 12 time 0.000157740116119 >> >> ____ >> ____ >> Results on my machine, dual xeon, 12 cores >> na=3290 >> nb=3400 >> ______ >> 100 trials each: >> c_threads 1 time 0.10744508028 >> c_threads 2 time 0.0542239999771 >> c_threads 3 time 0.037127559185 >> c_threads 4 time 0.0280736112595 >> c_threads 5 time 0.0228648614883 >> c_threads 6 time 0.0194904088974 >> c_threads 7 time 0.0165715909004 >> c_threads 8 time 0.0145838689804 >> c_threads 9 time 0.0130002498627 >> c_threads 10 time 0.0116940999031 >> c_threads 11 time 0.0107557415962 >> c_threads 12 time 0.00990005016327 (speedup almost 11) >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion