Re: SSE (Pentium 3) - Is this correct?

Dorit Nuzman Mon, 08 Jan 2007 00:01:59 -0800

> Even if you fix that, gcc will only vectorize if you pass the
> -ftree-vectorize option.  And it will only vectorize code in loops.


Supporting straight-line code vectorization is in the works, but at first
we'll look for such opportunities only in loops (i.e. exploit
vector-parallelism within an iteration rather than only across iterations).
So we'll be able to vectorize unrolled loops and such, but not the code
example in question, if it's not enclosed in any loop.

> And it unfortunately doesn't do a good job of using movups, so it will
> mess around with checking the alignment.

Yes, the way we handle unaligned stores is to peel the loop to make that
access aligned. We do use the movups for the load though. This is just a
random restriction that can easily be fixed - I have an old patch for
misaligned stores sitting around - I'll just go ahead and send it.

> And there isn't a good way
> to specify alignment.
>
> I do see use of the vector instructions for this example
>
>
> float *vector_add4f(float * __restrict va, float * __restrict vb)
> {
>   int i;
>
>   for (i = 0; i < 4; ++i)
>     va[i] += vb[i];
>   return va;
> }
>
> if I compile with -O2 -ftree-vectorize.  Frankly the generated code is
> really awful, and I wouldn't be surprised if it runs more slowly than
> the non-vectorized code.

If the va access is unaligned the vectorized code will definitely be slower
than the original code, because the vector code will not be executed - just
the peel-loop before the vector-loop (that tries to align the store) and
the peel-loop after he vector-loop (for the remaining iterations). So we
will just have executed extra ifs/branches and the code size will have
increased.  Probably even if the store is aligned it will not be much of a
win for such a small trip count. I have a small patch that lets you specify
the minimum number of vector iterations under which you don't want to allow
vectorization. I'll go ahead and send that too.

> This is evidently an area where the compiler
> could use more work.
>

True, and indeed there are people who are currently looking into adding a
cost model to the vectorizer. The patch I mentioned above would be a first
(trivial) step towards that.

dorit

> Ian

Re: SSE (Pentium 3) - Is this correct?

Reply via email to