Thanks a lot. Very informative. I guess what you say about "cache line is dirtied" is related to the info I got with valgrind (see my email in this thread: L1 Data Write Miss 3636). Can one assume that the cache line is always a few mega bytes ?
Thanks, Sebastian On Sat, Feb 19, 2011 at 12:40 AM, Sturla Molden <stu...@molden.no> wrote: > Den 17.02.2011 16:31, skrev Matthieu Brucher: > > It may also be the sizes of the chunk OMP uses. You can/should specify > them.in > > Matthieu > > the OMP pragma so that it is a multiple of the cache line size or something > close. > > Also beware of "false sharing" among the threads. When one processor updates > the array "dist" in Sebastian's code, the cache line is dirtied for the > other processors: > > #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y) > for(i=0;i<na;i++) { > ax=a_ps[i*nx1]; > ay=a_ps[i*nx1+1]; > for(j=0;j<nb;j++) { > dif_x = ax - b_ps[j*nx2]; > dif_y = ay - b_ps[j*nx2+1]; > > /* update shared memory */ > > dist[2*i+j] = sqrt(dif_x*dif_x+dif_y*dif_y); > > /* ... and poof the cache is dirty */ > > } > } > > Whenever this happens, the processors must stop whatever they are doing to > resynchronize their cache lines. "False sharing" can therefore work as an > "invisible GIL" inside OpenMP code.The processors can appear to run in > syrup, and there is excessive traffic on the memory bus. > > This is also why MPI programs often scale better than OpenMP programs, > despite the IPC overhead. > > An advice when working with OpenMP is to let each thread write to private > data arrays, and only share read-only arrays. > > One can e.g. use OpenMP's "reduction" pragma to achieve this. E.g. intialize > the array dist with zeros, and use reduction(+:dist) in the OpenMP pragma > line. > > Sturla > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion