http://gcc.gnu.org/bugzilla/show_bug.cgi?id=61000
Bug ID: 61000 Summary: No loop interchange for inner loop along the slow index Product: gcc Version: 4.10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: dominiq at lps dot ens.fr CC: grosser at gcc dot gnu.org, mircea.namolaru at inria dot fr Graphite is unable to do the loop interchange when the inner loop is along the slow index of an array: [Book15] Fortran/omp_tst% cat loop.f90 module parms implicit none private integer, parameter, public :: num = 1024 integer, parameter, public :: dp = kind(0.0d0) end module parms program loops use parms implicit none real(kind=dp), dimension(:, :), & allocatable :: a, c integer :: i, j, k, n_iter=100 integer(8) :: start, finish, counts allocate(a(num,num),c(num,num)) call random_number(a) c = 0 call system_clock (start, counts) do k=1,n_iter do i=1,num ! c(i,1) = 0.5*(a(i,2) - a(i,num)) ! c(i,num) = 0.5*(a(i,1) - a(i,num-1)) do j=2,num-1 c(i,j) = 0.5*(a(i,j+1) - a(i,j-1)) end do end do end do call system_clock (finish) print *, sum(abs(c)) ! To ensure computation print *, "Elapsed time =" ,& (finish - start) / real(counts, kind=8), "seconds" c = 0 call system_clock (start, counts) do k=1,n_iter ! do i=1,num ! c(i,1) = 0.5*(a(i,2) - a(i,num)) ! c(i,num) = 0.5*(a(i,1) - a(i,num-1)) ! end do do j=2,num-1 do i=1,num c(i,j) = 0.5*(a(i,j+1) - a(i,j-1)) end do end do end do call system_clock (finish) print *, sum(abs(c)) ! To ensure computation print *, "Elapsed time =" ,& (finish - start) / real(counts, kind=8), "seconds" end program loops [Book15] Fortran/omp_tst% gfc -Ofast -floop-interchange loop.f90 [Book15] Fortran/omp_tst% time a.out 174350.51293227341 Elapsed time = 2.1943769999999998 seconds 174350.51293227341 Elapsed time = 0.14006299999999999 seconds 2.347u 0.011s 0:02.36 99.5% 0+0k 0+0io 30pf+0w This may be a duplicate of pr36011, but the timings are not affected by adding -fno-tree-pre -fno-tree-loop-im. Note that gcc with -floop-interchange is able to optimize the matrix product (see pr14741 and pr60997). This also affects the polyhedron test air.f90. With the following patch --- air.f90 2009-08-28 14:22:26.000000000 +0200 +++ air_va.f90 2014-04-19 13:10:44.000000000 +0200 @@ -400,8 +400,8 @@ ! ! COMPUTE THE FLUX TERMS ! - DO i = 1 , MXPx - DO j = 1 , MXPy + DO j = 1 , MXPy + DO i = 1 , MXPx ! ! compute vanleer fluxes ! @@ -657,8 +657,8 @@ ENDDO ! ! COMPUTE THE FLUX TERMS - DO i = 1 , MXPx - DO j = 1 , MXPy + DO j = 1 , MXPy + DO i = 1 , MXPx ! ! compute vanleer fluxes ! the execution time goes from 3.2s to 2.7s (with -Ofast, with/without -floop-interchange).