Thanks, Richard. I will file a bug report and prepare a complete patch. For perm_mask_for_reverse function, should I move it before vectorizable_store or add a declaration.
Bingfeng -----Original Message----- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: 18 December 2013 11:26 To: Bingfeng Mei Cc: gcc-patches@gcc.gnu.org Subject: Re: Vectorization for store with negative step On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei <b...@broadcom.com> wrote: > Hi, > I was looking at some loops that can be vectorized by LLVM, but not GCC. One > type of loop is with store of negative step. > > void test1(short * __restrict__ x, short * __restrict__ y, short * > __restrict__ z) > { > int i; > for (i=127; i>=0; i--) { > x[i] = y[127-i] + z[127-i]; > } > } > > I don't know why GCC only implements negative step for load, but not store. I > implemented a patch, very similar to code in vectorizable_load. > > ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx > > Without patch: > test1: > .LFB0: > addq $254, %rdi > xorl %eax, %eax > .p2align 4,,10 > .p2align 3 > .L2: > movzwl (%rsi,%rax), %ecx > subq $2, %rdi > addw (%rdx,%rax), %cx > addq $2, %rax > movw %cx, 2(%rdi) > cmpq $256, %rax > jne .L2 > rep; ret > > With patch: > test1: > .LFB0: > vmovdqa .LC0(%rip), %xmm1 > xorl %eax, %eax > .p2align 4,,10 > .p2align 3 > .L2: > vmovdqu (%rsi,%rax), %xmm0 > movq %rax, %rcx > negq %rcx > vpaddw (%rdx,%rax), %xmm0, %xmm0 > vpshufb %xmm1, %xmm0, %xmm0 > addq $16, %rax > cmpq $256, %rax > vmovups %xmm0, 240(%rdi,%rcx) > jne .L2 > rep; ret > > Performance is definitely improved here. It is bootstrapped for > x86_64-unknown-linux-gnu, and has no additional regressions on my machine. > > For reference, LLVM seems to use different instructions and slightly worse > code. I am not so familiar with x86 assemble code. The patch is originally > for our private port. > test1: # @test1 > .cfi_startproc > # BB#0: # %entry > addq $240, %rdi > xorl %eax, %eax > .align 16, 0x90 > .LBB0_1: # %vector.body > # =>This Inner Loop Header: Depth=1 > movdqu (%rsi,%rax,2), %xmm0 > movdqu (%rdx,%rax,2), %xmm1 > paddw %xmm0, %xmm1 > shufpd $1, %xmm1, %xmm1 # xmm1 = xmm1[1,0] > pshuflw $27, %xmm1, %xmm0 # xmm0 = xmm1[3,2,1,0,4,5,6,7] > pshufhw $27, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2,3,7,6,5,4] > movdqu %xmm0, (%rdi) > addq $8, %rax > addq $-16, %rdi > cmpq $128, %rax > jne .LBB0_1 > # BB#2: # %for.end > ret > > Any comment? Looks good to me. One of the various TODOs in vectorizable_store I presume. Needs a testcase and at this stage a bugreport that is fixed by it. Thanks, Richard. > Bingfeng Mei > Broadcom UK > >