On 30 November 2011 02:33, Michael Hope <michael.h...@linaro.org> wrote:

> I then converted the vld1 and vst1 to specifiy an alignment of 64
> bits. See:
>  http://people.linaro.org/~michaelh/incoming/set-alignment.png
>
> This improved the throughput in all cases and in cases for more than 50
> words by 14 %.  This graph also shows the overhead of the runtime
> peeling check.  The blue line is the vectoriser version which is
> slower to pick up due the greater per call overhead.

So, the auto-vectorized code doesn't have the alignment hints (peeling
or not peeling), right? Is this how a hint is supposed to look like:
vld1.i64 {d16-d17}, [r1 :"#_128"] , or am I looking for a wrong thing?

I thought that peeling should be useful at least for the hints.

>
> I then went back to the vectoriser and changed the alignment of the
> struct to cause peeling to turn on and off.  See:
>  http://people.linaro.org/~michaelh/incoming/unroll.png
>
> At 200 words, the version without peeling is 2.9 % faster.  This is
> partly due to a fixed count loop turning into a runtime count due to
> unknown alignment.
>
> This run also showed the affect of loop unrolling.  The loop seems to
> be unrolled for loops of <= 64 words and drops off in performance past
> around 8 words.  When the unrolling finally drops out, performance
> increases by 101 %.

I see register spills starting from COUNT=36.

Ira

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain

Reply via email to