https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115438

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Another difference is that for

C       Local r*r norm
        r2=0.0
        do k=2,nzl+1
           do j=1,ny
              do i=1,nx
                 do l=1,nb
                    r(l,i,j,k) = b(l,i,j,k-1) - r(l,i,j,k)
                    r2 =r2+r(l,i,j,k)**2
                    rhat(l,i,j,k) = r(l,i,j,k)
                 enddo
              enddo
           enddo
        enddo

we're now ending up with hybrid SLP (SLP for the reduction and non-SLP
for the non-grouped stores).  In the end in .optimized the code looks
the same again though.

That's expected and will resolve itself.

Another difference is that without SLP we prefer to use a neutral element
as reduction init while with SLP we prefer the scalar initial values
as that's more efficient for SLP reductions and it might also reduce
lifetime of the reg holding the initial value.  I doubt this to be
the reason for the slowness, but it at least prevails.

Reply via email to