http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18437
--- Comment #5 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-07-27
12:38:20 UTC ---
The initial testcase is probably a bad example (3x3 matrix). The following
testcase is borrowed from Polyhedron rnflow and is vectorized by ICC but
not by GCC (the ICC variant is 15% faster):
function trs2a2 (j, k, u, d, m)
real, dimension (1:m,1:m) :: trs2a2
real, dimension (1:m,1:m) :: u, d
integer, intent (in) :: j, k, m
real (kind = selected_real_kind (10,50)) :: dtmp
trs2a2 = 0.0
do iclw1 = j, k - 1
do iclw2 = j, k - 1
dtmp = 0.0d0
do iclww = j, k - 1
dtmp = dtmp + u (iclw1, iclww) * d (iclww, iclw2)
enddo
trs2a2 (iclw1, iclw2) = dtmp
enddo
enddo
return
end function trs2a2
the reason why GCC cannot vectorize this is that the load from U has
a non-constant stride, so vectorization would need to load two scalars
and build up a vector (ICC does that). If the stride were constant
but not power-of-two GCC would reject that as well, probably to not
confuse the interleaving code. Data dependence analysis also rejects
non-constant strides.
Further complication (for the cost model) is the accumulator of
type double compared to the data types of float. ICC uses only
half of the float vectors here to handle mixed float/double type
loops (but it still unrolls the loop).