https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115438
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- Another difference is that for C Local r*r norm r2=0.0 do k=2,nzl+1 do j=1,ny do i=1,nx do l=1,nb r(l,i,j,k) = b(l,i,j,k-1) - r(l,i,j,k) r2 =r2+r(l,i,j,k)**2 rhat(l,i,j,k) = r(l,i,j,k) enddo enddo enddo enddo we're now ending up with hybrid SLP (SLP for the reduction and non-SLP for the non-grouped stores). In the end in .optimized the code looks the same again though. That's expected and will resolve itself. Another difference is that without SLP we prefer to use a neutral element as reduction init while with SLP we prefer the scalar initial values as that's more efficient for SLP reductions and it might also reduce lifetime of the reg holding the initial value. I doubt this to be the reason for the slowness, but it at least prevails.