http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18437

--- Comment #5 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-07-27 
12:38:20 UTC ---
The initial testcase is probably a bad example (3x3 matrix).  The following
testcase is borrowed from Polyhedron rnflow and is vectorized by ICC but
not by GCC (the ICC variant is 15% faster):

      function trs2a2 (j, k, u, d, m)
      real, dimension (1:m,1:m) :: trs2a2  
      real, dimension (1:m,1:m) :: u, d
      integer, intent (in)      :: j, k, m
      real (kind = selected_real_kind (10,50)) :: dtmp
      trs2a2 = 0.0
      do iclw1 = j, k - 1
         do iclw2 = j, k - 1
            dtmp = 0.0d0
            do iclww = j, k - 1
               dtmp = dtmp + u (iclw1, iclww) * d (iclww, iclw2)
            enddo
            trs2a2 (iclw1, iclw2) = dtmp
         enddo
      enddo
      return
      end function trs2a2

the reason why GCC cannot vectorize this is that the load from U has
a non-constant stride, so vectorization would need to load two scalars
and build up a vector (ICC does that).  If the stride were constant
but not power-of-two GCC would reject that as well, probably to not
confuse the interleaving code.  Data dependence analysis also rejects
non-constant strides.

Further complication (for the cost model) is the accumulator of
type double compared to the data types of float.  ICC uses only
half of the float vectors here to handle mixed float/double type
loops (but it still unrolls the loop).

Reply via email to