It may also be the sizes of the chunk OMP uses. You can/should specify them.in the OMP pragma so that it is a multiple of the cache line size or something close.
Matthieu 2011/2/17 Sebastian Haase <seb.ha...@gmail.com> > Hi, > More surprises: > shaase@iris:~/code/SwiggedDistOMP: gcc -O3 -c the_lib.c -fPIC -fopenmp > -ffast-math > shaase@iris:~/code/SwiggedDistOMP: gcc -shared -o the_lib.so the_lib.o > -lgomp -lm > shaase@iris:~/code/SwiggedDistOMP: priithon the_python_prog.py > c_threads 0 time 0.000437839031219 # this is now, without > #pragma omp parallel for ... > c_threads 1 time 0.000865449905396 > c_threads 2 time 0.000520548820496 > c_threads 3 time 0.00033704996109 > c_threads 4 time 0.000620169639587 > c_threads 5 time 0.000465350151062 > c_threads 6 time 0.000696349143982 > > This correct now the timing of, max OpenMP speed (3 threads) vs. no > OpenMP to speedup of (only!) 1.3x > Not 2.33x (which was the number I got when comparing OpenMP to the > cdist function). > The c code is now: > > the_lib.c > > ------------------------------------------------------------------------------------------ > #include <stdio.h> > #include <time.h> > #include <omp.h> > #include <math.h> > > void dists2d( double *a_ps, int na, > double *b_ps, int nb, > double *dist, int num_threads) > { > > int i, j; > double ax,ay, dif_x, dif_y; > int nx1=2; > int nx2=2; > > if(num_threads>0) > { > int dynamic=0; > omp_set_dynamic(dynamic); > omp_set_num_threads(num_threads); > > > #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y) > for(i=0;i<na;i++) > { > ax=a_ps[i*nx1]; > ay=a_ps[i*nx1+1]; > for(j=0;j<nb;j++) > { dif_x = ax - b_ps[j*nx2]; > dif_y = ay - b_ps[j*nx2+1]; > dist[2*i+j] = sqrt(dif_x*dif_x+dif_y*dif_y); > } > } > } else { > for(i=0;i<na;i++) > { > ax=a_ps[i*nx1]; > ay=a_ps[i*nx1+1]; > for(j=0;j<nb;j++) > { dif_x = ax - b_ps[j*nx2]; > dif_y = ay - b_ps[j*nx2+1]; > dist[2*i+j] = sqrt(dif_x*dif_x+dif_y*dif_y); > } > } > } > } > ------------------------------------------------------------------ > $ gcc -O3 -c the_lib.c -fPIC -fopenmp -ffast-math > $ gcc -shared -o the_lib.so the_lib.o -lgomp -lm > > So, I guess I found a way of getting rid of the OpenMP overhead when > run with 1 thread, > and found that - if measured correctly, using same compiler settings > and so on - the speedup is so small that there no point in doing > OpenMP - again. > (For my case, having (only) 4 cores) > > > Cheers, > Sebastian. > > > > On Thu, Feb 17, 2011 at 10:57 AM, Matthieu Brucher > <matthieu.bruc...@gmail.com> wrote: > > > >> Then, where does the overhead come from ? -- > >> The call to omp_set_dynamic(dynamic); > >> Or the > >> #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y) > > > > It may be this. You initialize a thread pool, even if it has only one > > thread, and there is the dynamic part, so OpenMP may create several > chunks > > instead of one big chunk. > > > > Matthieu > > -- > > Information System Engineer, Ph.D. > > Blog: http://matt.eifelle.com > > LinkedIn: http://www.linkedin.com/in/matthieubrucher > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion