Matthieu, I got it to run in valgrind (using the options from your blog). For dist2d() it says under "Types": ---------------------------------------------------------------------- Event Type Incl. Self Short Formula ------------------------------------------------------------------------------------ Instruction Fetch 379090 32 lr Data Read Access 116136 7 Dr Data Write Access 29921 11 Dw L1 Instn Fetch Miss 195 4 |1mr L1 Data Read Miss 270 2 Dlmr L1 Data Write Miss 3636 0 Dlmw L2 Instn Fetch Miss 125 4 |2mr L2 Data Read Miss 144 1 D2mr L2 Data Write Miss 83 0 D2mw L1 Miss Sum 4101 6 L1m = |1mr + D1mr + D1mw L2 Miss Sum 352 5 L2m = |2mr + D2mr + D2mw Cycle Estimation 455300592 CEst = Ir + 10 L1m + 100 L2m --------------------------------------------------------------------------------------- (hope this is readable - I used some OCR software to past it here)
(My source code still uses __restrict__) Do I see this right, that the problem is shown by the lines L1 Data Read Miss and L1 Miss Sum I have no idea what to look for .... Thanks for the help, -- Sebastian On Tue, Feb 15, 2011 at 5:25 PM, Matthieu Brucher <matthieu.bruc...@gmail.com> wrote: > Use directly restrict in C99 mode (__restrict does not have exactly the same > semantics). > For a valgrind profil, you can check my blog > (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/) > Basically, if you have a python script, you can valgrind --optionsinmyblog > python myscript.py > For PAPI, you have to install several packages (perf module for kernel for > instance) and a GUI to analyze the results (in Eclispe, it should be > possible). > Matthieu > 2011/2/15 Sebastian Haase <seb.ha...@gmail.com> >> >> Thanks Matthieu, >> using __restrict__ with g++ did not change anything. How do I use >> valgrind with C extensions? >> I don't know what "PAPI profil" is ...? >> -Sebastian >> >> >> On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher >> <matthieu.bruc...@gmail.com> wrote: >> > Hi, >> > My first move would be to add a restrict keyword to dist (i.e. dist is >> > the >> > only pointer to the specific memory location), and then declare dist_ >> > inside >> > the first loop also with a restrict. >> > Then, I would run valgrind or a PAPI profil on your code to see what >> > causes >> > the issue (false sharing, ...) >> > Matthieu >> > >> > 2011/2/15 Sebastian Haase <seb.ha...@gmail.com> >> >> >> >> Hi, >> >> I assume that someone here could maybe help me, and I'm hoping it's >> >> not too much off topic. >> >> I have 2 arrays of 2d point coordinates and would like to calculate >> >> all pairwise distances as fast as possible. >> >> Going from Python/Numpy to a (Swigged) C extension already gave me a >> >> 55x speedup. >> >> (.9ms vs. 50ms for arrays of length 329 and 340). >> >> I'm using gcc on Linux. >> >> Now I'm wondering if I could go even faster !? >> >> My hope that the compiler might automagically do some SSE2 >> >> optimization got disappointed. >> >> Since I have a 4 core CPU I thought OpenMP might be an option; >> >> I never used that, and after some playing around I managed to get >> >> (only) 50% slowdown(!) :-( >> >> >> >> My code in short is this: >> >> (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) >> >> -------<Ccode> ---------- >> >> void dists2d( >> >> double *a_ps, int nx1, int na, >> >> double *b_ps, int nx2, int nb, >> >> double *dist, int nx3, int ny3) throw (char*) >> >> { >> >> if(nx1 != 2) throw (char*) "a must be of shape (n,2)"; >> >> if(nx2 != 2) throw (char*) "b must be of shape (n,2)"; >> >> if(nx3 != nb || ny3 != na) throw (char*) "c must be of shape >> >> (na,nb)"; >> >> >> >> double *dist_; >> >> int i, j; >> >> >> >> #pragma omp parallel private(dist_, j, i) >> >> { >> >> #pragma omp for nowait >> >> for(i=0;i<na;i++) >> >> { >> >> //num_threads=omp_get_num_threads(); --> 4 >> >> dist_ = dist+i*nb; // dists_ is only >> >> introduced for OpenMP >> >> for(j=0;j<nb;j++) >> >> { >> >> *dist_++ = sqrt( sq(a_ps[i*nx1] - >> >> b_ps[j*nx2]) + >> >> >> >> sq(a_ps[i*nx1+1] >> >> - b_ps[j*nx2+1]) ); >> >> } >> >> } >> >> } >> >> } >> >> -------</Ccode> ---------- >> >> There is probably a simple mistake in this code - as I said I never >> >> used OpenMP before. >> >> It should be not too difficult to use OpenMP correctly here >> >> or - maybe better - >> >> is there a simple SSE(2,3,4) version that might be even better than >> >> OpenMP... !? >> >> >> >> I supposed, that I did not get the #pragma omp lines right - any idea ? >> >> Or is it in general not possible to speed this kind of code up using >> >> OpenMP !? >> >> >> >> Thanks, >> >> Sebastian Haase >> >> _______________________________________________ >> >> NumPy-Discussion mailing list >> >> NumPy-Discussion@scipy.org >> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > >> > >> > >> > -- >> > Information System Engineer, Ph.D. >> > Blog: http://matt.eifelle.com >> > LinkedIn: http://www.linkedin.com/in/matthieubrucher >> > >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion@scipy.org >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > >> > >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > -- > Information System Engineer, Ph.D. > Blog: http://matt.eifelle.com > LinkedIn: http://www.linkedin.com/in/matthieubrucher > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion