Wes, I think I should have a couple of GPUs. I would be ready for anything ... if you think that I could do some easy(!) CUDA programming here, maybe you could guide me into the right direction... Thanks, Sebastian.
On Tue, Feb 15, 2011 at 5:26 PM, Wes McKinney <wesmck...@gmail.com> wrote: > On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher > <matthieu.bruc...@gmail.com> wrote: >> Use directly restrict in C99 mode (__restrict does not have exactly the same >> semantics). >> For a valgrind profil, you can check my blog >> (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/) >> Basically, if you have a python script, you can valgrind --optionsinmyblog >> python myscript.py >> For PAPI, you have to install several packages (perf module for kernel for >> instance) and a GUI to analyze the results (in Eclispe, it should be >> possible). >> Matthieu >> 2011/2/15 Sebastian Haase <seb.ha...@gmail.com> >>> >>> Thanks Matthieu, >>> using __restrict__ with g++ did not change anything. How do I use >>> valgrind with C extensions? >>> I don't know what "PAPI profil" is ...? >>> -Sebastian >>> >>> >>> On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher >>> <matthieu.bruc...@gmail.com> wrote: >>> > Hi, >>> > My first move would be to add a restrict keyword to dist (i.e. dist is >>> > the >>> > only pointer to the specific memory location), and then declare dist_ >>> > inside >>> > the first loop also with a restrict. >>> > Then, I would run valgrind or a PAPI profil on your code to see what >>> > causes >>> > the issue (false sharing, ...) >>> > Matthieu >>> > >>> > 2011/2/15 Sebastian Haase <seb.ha...@gmail.com> >>> >> >>> >> Hi, >>> >> I assume that someone here could maybe help me, and I'm hoping it's >>> >> not too much off topic. >>> >> I have 2 arrays of 2d point coordinates and would like to calculate >>> >> all pairwise distances as fast as possible. >>> >> Going from Python/Numpy to a (Swigged) C extension already gave me a >>> >> 55x speedup. >>> >> (.9ms vs. 50ms for arrays of length 329 and 340). >>> >> I'm using gcc on Linux. >>> >> Now I'm wondering if I could go even faster !? >>> >> My hope that the compiler might automagically do some SSE2 >>> >> optimization got disappointed. >>> >> Since I have a 4 core CPU I thought OpenMP might be an option; >>> >> I never used that, and after some playing around I managed to get >>> >> (only) 50% slowdown(!) :-( >>> >> >>> >> My code in short is this: >>> >> (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i) >>> >> -------<Ccode> ---------- >>> >> void dists2d( >>> >> double *a_ps, int nx1, int na, >>> >> double *b_ps, int nx2, int nb, >>> >> double *dist, int nx3, int ny3) throw (char*) >>> >> { >>> >> if(nx1 != 2) throw (char*) "a must be of shape (n,2)"; >>> >> if(nx2 != 2) throw (char*) "b must be of shape (n,2)"; >>> >> if(nx3 != nb || ny3 != na) throw (char*) "c must be of shape >>> >> (na,nb)"; >>> >> >>> >> double *dist_; >>> >> int i, j; >>> >> >>> >> #pragma omp parallel private(dist_, j, i) >>> >> { >>> >> #pragma omp for nowait >>> >> for(i=0;i<na;i++) >>> >> { >>> >> //num_threads=omp_get_num_threads(); --> 4 >>> >> dist_ = dist+i*nb; // dists_ is only >>> >> introduced for OpenMP >>> >> for(j=0;j<nb;j++) >>> >> { >>> >> *dist_++ = sqrt( sq(a_ps[i*nx1] - >>> >> b_ps[j*nx2]) + >>> >> >>> >> sq(a_ps[i*nx1+1] >>> >> - b_ps[j*nx2+1]) ); >>> >> } >>> >> } >>> >> } >>> >> } >>> >> -------</Ccode> ---------- >>> >> There is probably a simple mistake in this code - as I said I never >>> >> used OpenMP before. >>> >> It should be not too difficult to use OpenMP correctly here >>> >> or - maybe better - >>> >> is there a simple SSE(2,3,4) version that might be even better than >>> >> OpenMP... !? >>> >> >>> >> I supposed, that I did not get the #pragma omp lines right - any idea ? >>> >> Or is it in general not possible to speed this kind of code up using >>> >> OpenMP !? >>> >> >>> >> Thanks, >>> >> Sebastian Haase >>> >> _______________________________________________ >>> >> NumPy-Discussion mailing list >>> >> NumPy-Discussion@scipy.org >>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> > >>> > >>> > >>> > -- >>> > Information System Engineer, Ph.D. >>> > Blog: http://matt.eifelle.com >>> > LinkedIn: http://www.linkedin.com/in/matthieubrucher >>> > >>> > _______________________________________________ >>> > NumPy-Discussion mailing list >>> > NumPy-Discussion@scipy.org >>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> > >>> > >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >> >> -- >> Information System Engineer, Ph.D. >> Blog: http://matt.eifelle.com >> LinkedIn: http://www.linkedin.com/in/matthieubrucher >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > As an aside, do you have a GPU-equipped machine? This function looks > like it would be an easy CUDA target. Depends on who the users of the > function are (e.g. do they also have the CUDA runtime) if whether you > wanted to go that route. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion