http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56741
Bug #: 56741 Summary: Why not to perform 128-bit vector iteration when vectorizing loop by 256-bit Classification: Unclassified Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: kirill.yuk...@intel.com Created attachment 29730 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=29730 Reproducer Hi guys, Suppse we vectorize loop with AVX[2]. E.g.: do i=0..N-1, ++i stmt [i]; enddo If vectorization is allowed & possible we'll have something like rem = N % VL /* VL is vector length. */ /* Vectorized loop. */ do i=0..N-rem-1, i+=VL v_stmt [i..i+VL]; enddo /* Remainder. */ do j=0..rem, ++j stmt [j+i]; enddo Remainder maybe unrolled, if allowed. For 128-bit vectors, we have remainder of 3 for floats and 1 for doubles maximum iterations. For 256-bit vectors this number of iterations is 7 and 3 correspondingly. Attached test shows 30% increase in instruction count because of loop remainder maximum iterations count. Why for AVX[2] not to add one iteration on 128-bit registers, having 3 and 1 iteration is remainder? Like this (necessary checks are omitted): rem_1 = N % VL1 /* VL1 is widest vector length - 256-bit. */ /* Vectorized loop. */ do i=0..N-rem_1-1, i+=VL1 v1_stmt[i..i+VL1]; /* Vectorized with 256-bit vector. */ enddo /* Additional iteration. */ v2_stmt [i..(i+VL2)]; /* Vectorized with 128-bit vector. */ rem_2 = rem_1-VL2; /* VL2 is narrow vector length - 128-bit. */ /* Remainder. */ do j=0..rem_2, ++j stmt[j+i]; enddo Here is how to reproduce: $ gcc -static -m64 -fstrict-aliasing -fno-prefetch-loop-arrays -Ofast -funroll-loops -fwhole-program -msse4 ./loop_vers.c -o loop_sse $ gcc -static -m64 -fstrict-aliasing -fno-prefetch-loop-arrays -Ofast -funroll-loops -fwhole-program -mavx ./loop_vers.c -o loop_avx $ sde -icount -- ./loop_sse 7 0.000000$$ TID: 0 ICOUNT: 16001317 $ sde -icount -- ./loop_avx 7 0.000000$$ TID: 0 ICOUNT: 20847322