Thank you for the reply! Regarding last part of your message, this is also what clang will do when you are passing vf of 4 (with the pragma from my first message) for the loop operating on chars plus using SSE2. It will do meaningful work only for 4 chars per iteration (a[0], zero, zero, zero, a[1], zero, zero, zero, etc.).
Please see example here: https://godbolt.org/g/3LAqZw Let's say that I know all possible trip counts for my inner loop. They all do not exceed 32. In the example above vf for this loop is 32. There is a runtime check, such that if trip count do not exceed 32 it will fall back to scalar version. As long as trip count is always lower that 32 - it always chooses scalar version at runtime. But theoretically, using SSE2 for trip count = 8 it can use half of xmm register (8 chars) to do meaningfull work. Is gcc vectorizer capable of doing this? If yes, can I somehow achieve this in gcc by tweaking the code or adding some pragma? On 19/10/2017, Jakub Jelinek <ja...@redhat.com> wrote: > On Thu, Oct 19, 2017 at 10:38:28AM +0200, Richard Biener wrote: >> On Thu, Oct 19, 2017 at 9:22 AM, Denis Bakhvalov <dendib...@gmail.com> >> wrote: >> > Hello! >> > >> > I have a hot inner loop which was vectorized by gcc, but I also want >> > compiler to unroll this loop by some factor. >> > It can be controled in clang with this pragma: >> > #pragma clang loop vectorize(enable) vectorize_width(8) >> > Please see example here: >> > https://godbolt.org/g/UJoUJn >> > >> > So I want to tell gcc something like this: >> > "I want you to vectorize the loop. After that I want you to unroll >> > this vectorized loop by some defined factor." >> > >> > I was playing with #pragma omp simd with the safelen clause, and >> > #pragma GCC optimize("unroll-loops") with no success. Compiler option >> > -fmax-unroll-times is not suitable for me, because it will affect >> > other parts of the code. >> > >> > Is it possible to achieve this somehow? >> >> No. > > #pragma omp simd has simdlen clause which is a hint on the preferable > vectorization factor, but the vectorizer doesn't use it so far; > probably it wouldn't be that hard to at least use that as the starting > factor if the target has multiple ones if it is one of those. > The vectorizer has some support for using wider vectorization factors > if there are mixed width types within the same loop, so perhaps > supporting 2x/4x/8x etc. sizes of the normally chosen width might not be > that hard. > What we don't have right now is support for using smaller > vectorization factors, which might be sometimes beneficial for -O2 > vectorization of mixed width type loops. We always use the vf derived > from the smallest width type, say when using SSE2 and there is a char type, > we try to use vf of 16 and if there is also int type, do operations on > those > in 4x as many instructions, while there is also an option to use > vf of 4 and for operations on char just do something meaningful only in 1/4 > of vector elements. The various x86 vector ISAs have instructions to > widen or narrow for conversions. > > In any case, no is the right answer right now, we don't have that > implemented. > > Jakub > -- Best regards, Denis.