On Thu, Jun 14, 2018 at 12:34:11PM +0200, Richard Biener wrote: > > #pragma omp atomic is now allowed inside of simd regions. > > Tested on x86_64-linux, committed to gomp-5_0-branch. > > > > We will actually not vectorize it then though, so some further work will be > > needed in the vectorizer to handle it. Either, if we have hw atomics for > > both > > the size of the scalar accesses and size of the whole vector type, the > > accesses are adjacent and known to be aligned, we could replace it with > > atomic on the whole vector, or emit as a small loop or unrolled loop doing > > the extraction, scalar atomics and if needed insert result back into > > vectors. Richard, thoughts on that? > > What's the semantic of this? Generally for non-vectorizable stmts
OpenMP already has #pragma omp ordered simd which specifies part of the loop body that should not be vectorized (which we right now just implement as forcing no vectorization) and I guess the atomics could be handled similarly. I.e. say for float a[64], b[64]; int c[64], d[64], e[64]; void foo (void) { #pragma omp simd for (int i = 0; i < 64; ++i) { int v; a[i] = sqrt (b[i]); c[i] = a[i]; #pragma omp atomic capture v = d[i] += c[i]; e[i] = v; } } vectorize it say with vf of 4 as: for (i = 0; i < 64; i += 4) { v4si v; *((v4sf *)&a[i]) = sqrtv4sf (*((v4sf *)&b[i])); *((v4si *)&c[i]) = fix_truncv4sfv4si (*((v4sf *)&a[i])); v4si c_ = *((v4si *)&c[i]); for (i_ = 0; i_ < 4; i_++) // possibly unrolled, in any case scalar v[i_] = __atomic_add_fetch_4(&d[i + i_], c_[i_], 0); // or, if we have hw supported __atomic_compare_exchange_16 and d is known // to be aligned to 128-bits, we could do a 128-bit load + vector add + // cmpxchg. e[i] = v; } The semantics of atomics inside of simd should be the same as of: float a[64], b[64]; int c[64], d[64], e[64]; void foo (void) { #pragma omp simd for (int i = 0; i < 64; ++i) { int v; a[i] = sqrt (b[i]); c[i] = a[i]; #pragma omp ordered simd { #pragma omp atomic capture v = d[i] += c[i]; } e[i] = v; } } in that it vectorizes (if possible) the loop, except for not vectorizing the ordered simd part of the loop, but instead iterating from 0 to vf-1 sequentially. Jakub