Thank you for the reply!

Regarding last part of your message, this is also what clang will do
when you are passing vf of 4 (with the pragma from my first message)
for the loop operating on chars plus using SSE2. It will do meaningful
work only for 4 chars per iteration (a[0], zero, zero, zero, a[1],
zero, zero, zero, etc.).

Please see example here:
https://godbolt.org/g/3LAqZw

Let's say that I know all possible trip counts for my inner loop. They
all do not exceed 32. In the example above vf for this loop is 32.
There is a runtime check, such that if trip count do not exceed 32 it
will fall back to scalar version.

As long as trip count is always lower that 32 - it always chooses
scalar version at runtime.
But theoretically, using SSE2 for trip count = 8 it can use half of
xmm register (8 chars) to do meaningfull work.

Is gcc vectorizer capable of doing this?
If yes, can I somehow achieve this in gcc by tweaking the code or
adding some pragma?

On 19/10/2017, Jakub Jelinek <ja...@redhat.com> wrote:
> On Thu, Oct 19, 2017 at 10:38:28AM +0200, Richard Biener wrote:
>> On Thu, Oct 19, 2017 at 9:22 AM, Denis Bakhvalov <dendib...@gmail.com>
>> wrote:
>> > Hello!
>> >
>> > I have a hot inner loop which was vectorized by gcc, but I also want
>> > compiler to unroll this loop by some factor.
>> > It can be controled in clang with this pragma:
>> > #pragma clang loop vectorize(enable) vectorize_width(8)
>> > Please see example here:
>> > https://godbolt.org/g/UJoUJn
>> >
>> > So I want to tell gcc something like this:
>> > "I want you to vectorize the loop. After that I want you to unroll
>> > this vectorized loop by some defined factor."
>> >
>> > I was playing with #pragma omp simd with the safelen clause, and
>> > #pragma GCC optimize("unroll-loops") with no success. Compiler option
>> > -fmax-unroll-times is not suitable for me, because it will affect
>> > other parts of the code.
>> >
>> > Is it possible to achieve this somehow?
>>
>> No.
>
> #pragma omp simd has simdlen clause which is a hint on the preferable
> vectorization factor, but the vectorizer doesn't use it so far;
> probably it wouldn't be that hard to at least use that as the starting
> factor if the target has multiple ones if it is one of those.
> The vectorizer has some support for using wider vectorization factors
> if there are mixed width types within the same loop, so perhaps
> supporting 2x/4x/8x etc. sizes of the normally chosen width might not be
> that hard.
> What we don't have right now is support for using smaller
> vectorization factors, which might be sometimes beneficial for -O2
> vectorization of mixed width type loops.  We always use the vf derived
> from the smallest width type, say when using SSE2 and there is a char type,
> we try to use vf of 16 and if there is also int type, do operations on
> those
> in 4x as many instructions, while there is also an option to use
> vf of 4 and for operations on char just do something meaningful only in 1/4
> of vector elements.  The various x86 vector ISAs have instructions to
> widen or narrow for conversions.
>
> In any case, no is the right answer right now, we don't have that
> implemented.
>
>       Jakub
>


-- 
Best regards,
Denis.

Reply via email to