https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81108
--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> --- GCC implements what is required if there is schedule(static), which is the implementation defined schedule right now, which gives the requirement how the iterations are distributed to different threads and I don't see how could you get good performance with that distribution (if you have ideas, feel free to explain them here). In order to perform well on this testcase (which doesn't look very suitable for doacross because the computation is inexpensive and so the needed synchronization dominates the execution time), we'd have to use a different schedule, specific for this exact loop (proceed diagonally from 2, 2 to n, m or something like that).