On 30 November 2011 02:33, Michael Hope <michael.h...@linaro.org> wrote:
> I then converted the vld1 and vst1 to specifiy an alignment of 64 > bits. See: > http://people.linaro.org/~michaelh/incoming/set-alignment.png > > This improved the throughput in all cases and in cases for more than 50 > words by 14 %. This graph also shows the overhead of the runtime > peeling check. The blue line is the vectoriser version which is > slower to pick up due the greater per call overhead. So, the auto-vectorized code doesn't have the alignment hints (peeling or not peeling), right? Is this how a hint is supposed to look like: vld1.i64 {d16-d17}, [r1 :"#_128"] , or am I looking for a wrong thing? I thought that peeling should be useful at least for the hints. > > I then went back to the vectoriser and changed the alignment of the > struct to cause peeling to turn on and off. See: > http://people.linaro.org/~michaelh/incoming/unroll.png > > At 200 words, the version without peeling is 2.9 % faster. This is > partly due to a fixed count loop turning into a runtime count due to > unknown alignment. > > This run also showed the affect of loop unrolling. The loop seems to > be unrolled for loops of <= 64 words and drops off in performance past > around 8 words. When the unrolling finally drops out, performance > increases by 101 %. I see register spills starting from COUNT=36. Ira _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain