https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859
--- Comment #12 from Jakub Jelinek <jakub at gcc dot gnu.org> --- (In reply to Thorsten Kurth from comment #11) > yes, you are right. I thought that map(tofrom:XXXX) is the default mapping > but I might be wrong. In any case, teams is always 1. So this code is Variables that aren't pointers nor scalars are still implicitly map(tofrom:XXXX), scalars are implicitly firstprivate(XXXX), pointers are map(alloc:ptr[0:0]). > basically just data streaming so there is no need for a detailed > performance analysis. When I timed the code (not profiling it) the OpenMP > 4.5 code had a tiny bit more overhead, but not significant. > However, we might nevertheless learn from that. What kind of compiler options you use? -O2 -fopenmp, -O3 -fopenmp, -Ofast -fopenmp, something different? What ISA choice? -march=native, -mavx2, ...? The 10x slowdown could most likely be explained by the inner loop being vectorized in one case and not the other. You aren't using #pragma omp parallel for simd that you'd explicitly ask for vectorization e.g. even at -O2 -fopenmp.