Den 17.02.2011 16:31, skrev Matthieu Brucher:
It may also be the sizes of the chunk OMP uses. You can/should specify them.in <http://them.in>

Matthieu

the OMP pragma so that it is a multiple of the cache line size or something close.

Also beware of "false sharing" among the threads. When one processor updates the array "dist" in Sebastian's code, the cache line is dirtied for the other processors:

  #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
  for(i=0;i<na;i++){
     ax=a_ps[i*nx1];
     ay=a_ps[i*nx1+1];
     for(j=0;j<nb;j++) {
         dif_x = ax - b_ps[j*nx2];
         dif_y = ay - b_ps[j*nx2+1];

         /* update shared memory */

         dist[2*i+j]  = sqrt(dif_x*dif_x+dif_y*dif_y);

         /* ... and poof the cache is dirty */

     }
  }

Whenever this happens, the processors must stop whatever they are doing to resynchronize their cache lines. "False sharing" can therefore work as an "invisible GIL" inside OpenMP code.The processors can appear to run in syrup, and there is excessive traffic on the memory bus.

This is also why MPI programs often scale better than OpenMP programs, despite the IPC overhead.

An advice when working with OpenMP is to let each thread write to private data arrays, and only share read-only arrays.

One can e.g. use OpenMP's "reduction" pragma to achieve this. E.g. intialize the array dist with zeros, and use reduction(+:dist) in the OpenMP pragma line.

Sturla
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to