http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56741
Bug #: 56741
Summary: Why not to perform 128-bit vector iteration when
vectorizing loop by 256-bit
Classification: Unclassified
Product: gcc
Version: 4.9.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
AssignedTo: [email protected]
ReportedBy: [email protected]
Created attachment 29730
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29730
Reproducer
Hi guys,
Suppse we vectorize loop with AVX[2].
E.g.:
do i=0..N-1, ++i
stmt [i];
enddo
If vectorization is allowed & possible we'll have something like
rem = N % VL /* VL is vector length. */
/* Vectorized loop. */
do i=0..N-rem-1, i+=VL
v_stmt [i..i+VL];
enddo
/* Remainder. */
do j=0..rem, ++j
stmt [j+i];
enddo
Remainder maybe unrolled, if allowed.
For 128-bit vectors, we have remainder of 3 for floats and 1 for doubles
maximum iterations.
For 256-bit vectors this number of iterations is 7 and 3 correspondingly.
Attached test shows 30% increase in instruction count because of loop remainder
maximum iterations count.
Why for AVX[2] not to add one iteration on 128-bit registers, having 3 and 1
iteration is remainder?
Like this (necessary checks are omitted):
rem_1 = N % VL1 /* VL1 is widest vector length - 256-bit. */
/* Vectorized loop. */
do i=0..N-rem_1-1, i+=VL1
v1_stmt[i..i+VL1]; /* Vectorized with 256-bit vector. */
enddo
/* Additional iteration. */
v2_stmt [i..(i+VL2)]; /* Vectorized with 128-bit vector. */
rem_2 = rem_1-VL2; /* VL2 is narrow vector length - 128-bit. */
/* Remainder. */
do j=0..rem_2, ++j
stmt[j+i];
enddo
Here is how to reproduce:
$ gcc -static -m64 -fstrict-aliasing -fno-prefetch-loop-arrays -Ofast
-funroll-loops -fwhole-program -msse4 ./loop_vers.c -o loop_sse
$ gcc -static -m64 -fstrict-aliasing -fno-prefetch-loop-arrays -Ofast
-funroll-loops -fwhole-program -mavx ./loop_vers.c -o loop_avx
$ sde -icount -- ./loop_sse 7
0.000000$$ TID: 0 ICOUNT: 16001317
$ sde -icount -- ./loop_avx 7
0.000000$$ TID: 0 ICOUNT: 20847322