Thanks Matthieu, using __restrict__ with g++ did not change anything. How do I use valgrind with C extensions? I don't know what "PAPI profil" is ...? -Sebastian
On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher <matthieu.bruc...@gmail.com> wrote: > Hi, > My first move would be to add a restrict keyword to dist (i.e. dist is the > only pointer to the specific memory location), and then declare dist_ inside > the first loop also with a restrict. > Then, I would run valgrind or a PAPI profil on your code to see what causes > the issue (false sharing, ...) > Matthieu > > 2011/2/15 Sebastian Haase <seb.ha...@gmail.com> >> >> Hi, >> I assume that someone here could maybe help me, and I'm hoping it's >> not too much off topic. >> I have 2 arrays of 2d point coordinates and would like to calculate >> all pairwise distances as fast as possible. >> Going from Python/Numpy to a (Swigged) C extension already gave me a >> 55x speedup. >> (.9ms vs. 50ms for arrays of length 329 and 340). >> I'm using gcc on Linux. >> Now I'm wondering if I could go even faster !? >> My hope that the compiler might automagically do some SSE2 >> optimization got disappointed. >> Since I have a 4 core CPU I thought OpenMP might be an option; >> I never used that, and after some playing around I managed to get >> (only) 50% slowdown(!) :-( >> >> My code in short is this: >> (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) >> -------<Ccode> ---------- >> void dists2d( >> double *a_ps, int nx1, int na, >> double *b_ps, int nx2, int nb, >> double *dist, int nx3, int ny3) throw (char*) >> { >> if(nx1 != 2) throw (char*) "a must be of shape (n,2)"; >> if(nx2 != 2) throw (char*) "b must be of shape (n,2)"; >> if(nx3 != nb || ny3 != na) throw (char*) "c must be of shape (na,nb)"; >> >> double *dist_; >> int i, j; >> >> #pragma omp parallel private(dist_, j, i) >> { >> #pragma omp for nowait >> for(i=0;i<na;i++) >> { >> //num_threads=omp_get_num_threads(); --> 4 >> dist_ = dist+i*nb; // dists_ is only >> introduced for OpenMP >> for(j=0;j<nb;j++) >> { >> *dist_++ = sqrt( sq(a_ps[i*nx1] - b_ps[j*nx2]) + >> sq(a_ps[i*nx1+1] >> - b_ps[j*nx2+1]) ); >> } >> } >> } >> } >> -------</Ccode> ---------- >> There is probably a simple mistake in this code - as I said I never >> used OpenMP before. >> It should be not too difficult to use OpenMP correctly here >> or - maybe better - >> is there a simple SSE(2,3,4) version that might be even better than >> OpenMP... !? >> >> I supposed, that I did not get the #pragma omp lines right - any idea ? >> Or is it in general not possible to speed this kind of code up using >> OpenMP !? >> >> Thanks, >> Sebastian Haase >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > -- > Information System Engineer, Ph.D. > Blog: http://matt.eifelle.com > LinkedIn: http://www.linkedin.com/in/matthieubrucher > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion