Re: Effect of alignment and peeling on vectorised loops
On 30 November 2011 02:33, Michael Hope wrote: > I then converted the vld1 and vst1 to specifiy an alignment of 64 > bits. See: > http://people.linaro.org/~michaelh/incoming/set-alignment.png > > This improved the throughput in all cases and in cases for more than 50 > words by 14 %. This graph also shows the overhead of the runtime > peeling check. The blue line is the vectoriser version which is > slower to pick up due the greater per call overhead. So, the auto-vectorized code doesn't have the alignment hints (peeling or not peeling), right? Is this how a hint is supposed to look like: vld1.i64 {d16-d17}, [r1 :"#_128"] , or am I looking for a wrong thing? I thought that peeling should be useful at least for the hints. > > I then went back to the vectoriser and changed the alignment of the > struct to cause peeling to turn on and off. See: > http://people.linaro.org/~michaelh/incoming/unroll.png > > At 200 words, the version without peeling is 2.9 % faster. This is > partly due to a fixed count loop turning into a runtime count due to > unknown alignment. > > This run also showed the affect of loop unrolling. The loop seems to > be unrolled for loops of <= 64 words and drops off in performance past > around 8 words. When the unrolling finally drops out, performance > increases by 101 %. I see register spills starting from COUNT=36. Ira ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: gcc4.6,how to remove werror
On Tue, Nov 29, 2011 at 09:43:01PM +0700, tknv wrote: > This issue happend,when I was compiling ARM kernel. > I could not get it where -Werror gets overriden in tools/perf. > Could you tell me where is it ? In tools/perf/Makefile: 34:# Define WERROR=0 to disable treating any warnings as errors. 69:ifneq ($(WERROR),0) 70: CFLAGS_WERROR := -Werror 105:CFLAGS = -fno-omit-frame-pointer -ggdb3 -Wall -Wextra -std=gnu99 $(CFLAGS_WERROR) $(CFLAGS_OPTIMIZE) -D_FORTIFY_SOURCE=2 $(EXTRA_WARNINGS) $(EXTRA_CFLAGS) This Makefile _only_ applies to the perf tools. The rest of the kernel uses other Makefiles. The perf tools are not built automatically, so if you are only trying to build a kernel image then you must have some other problem. As I suggested previously, can you please run make V=1 and send the last few lines of output, including the error message *and* the command which caused it. Without a precise indication of where the error occurs, I can't offer much help. Cheers ---Dave ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: Effect of alignment and peeling on vectorised loops
On Thu, Dec 1, 2011 at 12:20 AM, Ira Rosen wrote: > On 30 November 2011 02:33, Michael Hope wrote: > >> I then converted the vld1 and vst1 to specifiy an alignment of 64 >> bits. See: >> http://people.linaro.org/~michaelh/incoming/set-alignment.png >> >> This improved the throughput in all cases and in cases for more than 50 >> words by 14 %. This graph also shows the overhead of the runtime >> peeling check. The blue line is the vectoriser version which is >> slower to pick up due the greater per call overhead. > > So, the auto-vectorized code doesn't have the alignment hints (peeling > or not peeling), right? Is this how a hint is supposed to look like: > vld1.i64 {d16-d17}, [r1 :"#_128"] , or am I looking for a wrong thing? Yip. We currently use a vldmia r1!, {d16-d17} which (on the A9 at least) only works for aligned values and takes the same time as the unaligned-friendly vld1.i64 {d16-d17}, [r1]! > I thought that peeling should be useful at least for the hints. Peeling and using the vld1.i64 {d16-d17}, [r1:64]! form should be faster for larger loops. For some reason vld1.i64 ..., [r1:128] gives an illegal instruction trap on my board. Note that the :128 is in bits. >> I then went back to the vectoriser and changed the alignment of the >> struct to cause peeling to turn on and off. See: >> http://people.linaro.org/~michaelh/incoming/unroll.png >> >> At 200 words, the version without peeling is 2.9 % faster. This is >> partly due to a fixed count loop turning into a runtime count due to >> unknown alignment. >> >> This run also showed the affect of loop unrolling. The loop seems to >> be unrolled for loops of <= 64 words and drops off in performance past >> around 8 words. When the unrolling finally drops out, performance >> increases by 101 %. > > I see register spills starting from COUNT=36. Ah. Does the vectoriser cost model take register pressure into account? How can I turn this on? -- Michael ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain
Re: Effect of alignment and peeling on vectorised loops
On Thu, Dec 1, 2011 at 12:20 AM, Ira Rosen wrote: > On 30 November 2011 02:33, Michael Hope wrote: > >> I then converted the vld1 and vst1 to specifiy an alignment of 64 >> bits. See: >> http://people.linaro.org/~michaelh/incoming/set-alignment.png >> >> This improved the throughput in all cases and in cases for more than 50 >> words by 14 %. This graph also shows the overhead of the runtime >> peeling check. The blue line is the vectoriser version which is >> slower to pick up due the greater per call overhead. > > So, the auto-vectorized code doesn't have the alignment hints (peeling > or not peeling), right? Is this how a hint is supposed to look like: > vld1.i64 {d16-d17}, [r1 :"#_128"] , or am I looking for a wrong thing? I had a look in the backend and the vld1/vst1 %A operand adds the alignment if known. It correctly adds [r1:64] if I feed in an array of int64s. The code checks based on MEM_ALIGN and MEM_SIZE of the operand: align = MEM_ALIGN (x) >> 3; memsize = INTVAL (MEM_SIZE (x)); Not sure why the backend generates a vldmia instead of a vld1 though. -- Michael ___ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain