How to force gcc to vectorize the loop with particular vectorization width
Hello! I have a hot inner loop which was vectorized by gcc, but I also want compiler to unroll this loop by some factor. It can be controled in clang with this pragma: #pragma clang loop vectorize(enable) vectorize_width(8) Please see example here: https://godbolt.org/g/UJoUJn So I want to tell gcc something like this: "I want you to vectorize the loop. After that I want you to unroll this vectorized loop by some defined factor." I was playing with #pragma omp simd with the safelen clause, and #pragma GCC optimize("unroll-loops") with no success. Compiler option -fmax-unroll-times is not suitable for me, because it will affect other parts of the code. Is it possible to achieve this somehow? -- Best regards, Denis.
Re: How to force gcc to vectorize the loop with particular vectorization width
Thank you for the reply! Regarding last part of your message, this is also what clang will do when you are passing vf of 4 (with the pragma from my first message) for the loop operating on chars plus using SSE2. It will do meaningful work only for 4 chars per iteration (a[0], zero, zero, zero, a[1], zero, zero, zero, etc.). Please see example here: https://godbolt.org/g/3LAqZw Let's say that I know all possible trip counts for my inner loop. They all do not exceed 32. In the example above vf for this loop is 32. There is a runtime check, such that if trip count do not exceed 32 it will fall back to scalar version. As long as trip count is always lower that 32 - it always chooses scalar version at runtime. But theoretically, using SSE2 for trip count = 8 it can use half of xmm register (8 chars) to do meaningfull work. Is gcc vectorizer capable of doing this? If yes, can I somehow achieve this in gcc by tweaking the code or adding some pragma? On 19/10/2017, Jakub Jelinek wrote: > On Thu, Oct 19, 2017 at 10:38:28AM +0200, Richard Biener wrote: >> On Thu, Oct 19, 2017 at 9:22 AM, Denis Bakhvalov >> wrote: >> > Hello! >> > >> > I have a hot inner loop which was vectorized by gcc, but I also want >> > compiler to unroll this loop by some factor. >> > It can be controled in clang with this pragma: >> > #pragma clang loop vectorize(enable) vectorize_width(8) >> > Please see example here: >> > https://godbolt.org/g/UJoUJn >> > >> > So I want to tell gcc something like this: >> > "I want you to vectorize the loop. After that I want you to unroll >> > this vectorized loop by some defined factor." >> > >> > I was playing with #pragma omp simd with the safelen clause, and >> > #pragma GCC optimize("unroll-loops") with no success. Compiler option >> > -fmax-unroll-times is not suitable for me, because it will affect >> > other parts of the code. >> > >> > Is it possible to achieve this somehow? >> >> No. > > #pragma omp simd has simdlen clause which is a hint on the preferable > vectorization factor, but the vectorizer doesn't use it so far; > probably it wouldn't be that hard to at least use that as the starting > factor if the target has multiple ones if it is one of those. > The vectorizer has some support for using wider vectorization factors > if there are mixed width types within the same loop, so perhaps > supporting 2x/4x/8x etc. sizes of the normally chosen width might not be > that hard. > What we don't have right now is support for using smaller > vectorization factors, which might be sometimes beneficial for -O2 > vectorization of mixed width type loops. We always use the vf derived > from the smallest width type, say when using SSE2 and there is a char type, > we try to use vf of 16 and if there is also int type, do operations on > those > in 4x as many instructions, while there is also an option to use > vf of 4 and for operations on char just do something meaningful only in 1/4 > of vector elements. The various x86 vector ISAs have instructions to > widen or narrow for conversions. > > In any case, no is the right answer right now, we don't have that > implemented. > > Jakub > -- Best regards, Denis.
Re: How to force gcc to vectorize the loop with particular vectorization width
Hello Richard, Thank you. I achieved vectorization with vf = 16, using #pragma GCC optimize ("no-unroll-loops") __attribute__ ((__target__ ("sse4.2"))) and options -march=core-avx2 -mprefer-avx-128 But now I have a question: Is it possible in gcc to have vectorization with vf < 16? On 20/10/2017, Richard Biener wrote: > On Fri, Oct 20, 2017 at 12:13 PM, Denis Bakhvalov > wrote: >> Thank you for the reply! >> >> Regarding last part of your message, this is also what clang will do >> when you are passing vf of 4 (with the pragma from my first message) >> for the loop operating on chars plus using SSE2. It will do meaningful >> work only for 4 chars per iteration (a[0], zero, zero, zero, a[1], >> zero, zero, zero, etc.). >> >> Please see example here: >> https://godbolt.org/g/3LAqZw >> >> Let's say that I know all possible trip counts for my inner loop. They >> all do not exceed 32. In the example above vf for this loop is 32. >> There is a runtime check, such that if trip count do not exceed 32 it >> will fall back to scalar version. >> >> As long as trip count is always lower that 32 - it always chooses >> scalar version at runtime. >> But theoretically, using SSE2 for trip count = 8 it can use half of >> xmm register (8 chars) to do meaningfull work. >> >> Is gcc vectorizer capable of doing this? >> If yes, can I somehow achieve this in gcc by tweaking the code or >> adding some pragma? > > The closest is to use -mprefer-avx128 so you get SSE rather than AVX > vector sizes. Eventually this option is among the valid target attributes > for #pragma GCC target > >> On 19/10/2017, Jakub Jelinek wrote: >>> On Thu, Oct 19, 2017 at 10:38:28AM +0200, Richard Biener wrote: >>>> On Thu, Oct 19, 2017 at 9:22 AM, Denis Bakhvalov >>>> wrote: >>>> > Hello! >>>> > >>>> > I have a hot inner loop which was vectorized by gcc, but I also want >>>> > compiler to unroll this loop by some factor. >>>> > It can be controled in clang with this pragma: >>>> > #pragma clang loop vectorize(enable) vectorize_width(8) >>>> > Please see example here: >>>> > https://godbolt.org/g/UJoUJn >>>> > >>>> > So I want to tell gcc something like this: >>>> > "I want you to vectorize the loop. After that I want you to unroll >>>> > this vectorized loop by some defined factor." >>>> > >>>> > I was playing with #pragma omp simd with the safelen clause, and >>>> > #pragma GCC optimize("unroll-loops") with no success. Compiler option >>>> > -fmax-unroll-times is not suitable for me, because it will affect >>>> > other parts of the code. >>>> > >>>> > Is it possible to achieve this somehow? >>>> >>>> No. >>> >>> #pragma omp simd has simdlen clause which is a hint on the preferable >>> vectorization factor, but the vectorizer doesn't use it so far; >>> probably it wouldn't be that hard to at least use that as the starting >>> factor if the target has multiple ones if it is one of those. >>> The vectorizer has some support for using wider vectorization factors >>> if there are mixed width types within the same loop, so perhaps >>> supporting 2x/4x/8x etc. sizes of the normally chosen width might not be >>> that hard. >>> What we don't have right now is support for using smaller >>> vectorization factors, which might be sometimes beneficial for -O2 >>> vectorization of mixed width type loops. We always use the vf derived >>> from the smallest width type, say when using SSE2 and there is a char >>> type, >>> we try to use vf of 16 and if there is also int type, do operations on >>> those >>> in 4x as many instructions, while there is also an option to use >>> vf of 4 and for operations on char just do something meaningful only in >>> 1/4 >>> of vector elements. The various x86 vector ISAs have instructions to >>> widen or narrow for conversions. >>> >>> In any case, no is the right answer right now, we don't have that >>> implemented. >>> >>> Jakub >>> >> >> >> -- >> Best regards, >> Denis. > -- Best regards, Denis.