[Bug tree-optimization/56741] New: Why not to perform 128-bit vector iteration when vectorizing loop by 256-bit

kirill.yukhin at intel dot com Tue, 26 Mar 2013 07:10:46 -0700


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56741




             Bug #: 56741

           Summary: Why not to perform 128-bit vector iteration when

                    vectorizing loop by 256-bit

    Classification: Unclassified

           Product: gcc

           Version: 4.9.0

            Status: UNCONFIRMED

          Severity: normal

          Priority: P3

         Component: tree-optimization

        AssignedTo: unassig...@gcc.gnu.org

        ReportedBy: kirill.yuk...@intel.com





Created attachment 29730

  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29730

Reproducer



Hi guys,

Suppse we vectorize loop with AVX[2].

E.g.:

do i=0..N-1, ++i

  stmt [i];

enddo



If vectorization is allowed & possible we'll have something like



rem = N % VL /* VL is vector length.  */

/* Vectorized loop.  */

do i=0..N-rem-1, i+=VL

  v_stmt [i..i+VL];

enddo



/* Remainder.  */

do j=0..rem, ++j

  stmt [j+i];

enddo



Remainder maybe unrolled, if allowed.



For 128-bit vectors, we have remainder of 3 for floats and 1 for doubles

maximum iterations.



For 256-bit vectors this number of iterations is 7 and 3 correspondingly.



Attached test shows 30% increase in instruction count because of loop remainder

maximum iterations count.



Why for AVX[2] not to add one iteration on 128-bit registers, having 3 and 1

iteration is remainder?



Like this (necessary checks are omitted):



rem_1 = N % VL1 /* VL1 is widest vector length - 256-bit.  */

/* Vectorized loop.  */

do i=0..N-rem_1-1, i+=VL1

  v1_stmt[i..i+VL1]; /* Vectorized with 256-bit vector.  */

enddo



/* Additional iteration.  */

v2_stmt [i..(i+VL2)]; /* Vectorized with 128-bit vector.  */



rem_2 = rem_1-VL2; /* VL2 is narrow vector length - 128-bit.  */



/* Remainder.  */

do j=0..rem_2, ++j

  stmt[j+i];

enddo





Here is how to reproduce:

$ gcc -static -m64 -fstrict-aliasing -fno-prefetch-loop-arrays -Ofast

-funroll-loops -fwhole-program -msse4 ./loop_vers.c -o loop_sse



$ gcc -static -m64 -fstrict-aliasing -fno-prefetch-loop-arrays -Ofast

-funroll-loops -fwhole-program -mavx ./loop_vers.c -o loop_avx



$ sde -icount -- ./loop_sse 7

0.000000$$ TID: 0 ICOUNT: 16001317



$ sde -icount -- ./loop_avx 7

0.000000$$ TID: 0 ICOUNT: 20847322

[Bug tree-optimization/56741] New: Why not to perform 128-bit vector iteration when vectorizing loop by 256-bit

Reply via email to