Bingfeng Mei wrote:
Hi,
I created PR59544 and here is the patch. OK to commit?
Thanks,
Bingfeng


2013-12-18  Bingfeng Mei  <b...@broadcom.com>

        PR tree-optimization/59544
     * tree-vect-stmts.c (perm_mask_for_reverse): Move before
       vectorizable_store. (vectorizable_store): Handle negative step.

2013-12-18  Bingfeng Mei  <b...@broadcom.com>

        PR tree-optimization/59544
        * gcc.target/i386/pr59544.c: New test


Hi Bingfeng,

Your patch seems to have a dependence calculation bug(I think) due to which gcc.dg/torture/pr52943.c regresses on aarch64. I've raised http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59651.

Do you think you could have a look?

Thanks,
Tejas.

-----Original Message-----
From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On 
Behalf Of Richard Biener
Sent: 18 December 2013 11:47
To: Bingfeng Mei
Cc: gcc-patches@gcc.gnu.org
Subject: Re: Vectorization for store with negative step

On Wed, Dec 18, 2013 at 12:34 PM, Bingfeng Mei <b...@broadcom.com> wrote:
Thanks, Richard. I will file a bug report and prepare a complete patch. For 
perm_mask_for_reverse function, should I move it before vectorizable_store or 
add a declaration.

Move it.

Richard.

Bingfeng
-----Original Message-----
From: Richard Biener [mailto:richard.guent...@gmail.com]
Sent: 18 December 2013 11:26
To: Bingfeng Mei
Cc: gcc-patches@gcc.gnu.org
Subject: Re: Vectorization for store with negative step

On Mon, Dec 16, 2013 at 5:54 PM, Bingfeng Mei <b...@broadcom.com> wrote:
Hi,
I was looking at some loops that can be vectorized by LLVM, but not GCC. One 
type of loop is with store of negative step.

void test1(short * __restrict__ x, short * __restrict__ y, short * __restrict__ 
z)
{
    int i;
    for (i=127; i>=0; i--) {
        x[i] = y[127-i] + z[127-i];
    }
}

I don't know why GCC only implements negative step for load, but not store. I 
implemented a patch, very similar to code in vectorizable_load.

~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx

Without patch:
test1:
.LFB0:
        addq    $254, %rdi
        xorl    %eax, %eax
        .p2align 4,,10
        .p2align 3
.L2:
        movzwl  (%rsi,%rax), %ecx
        subq    $2, %rdi
        addw    (%rdx,%rax), %cx
        addq    $2, %rax
        movw    %cx, 2(%rdi)
        cmpq    $256, %rax
        jne     .L2
        rep; ret

With patch:
test1:
.LFB0:
        vmovdqa .LC0(%rip), %xmm1
        xorl    %eax, %eax
        .p2align 4,,10
        .p2align 3
.L2:
        vmovdqu (%rsi,%rax), %xmm0
        movq    %rax, %rcx
        negq    %rcx
        vpaddw  (%rdx,%rax), %xmm0, %xmm0
        vpshufb %xmm1, %xmm0, %xmm0
        addq    $16, %rax
        cmpq    $256, %rax
        vmovups %xmm0, 240(%rdi,%rcx)
        jne     .L2
        rep; ret

Performance is definitely improved here. It is bootstrapped for 
x86_64-unknown-linux-gnu, and has no additional regressions on my machine.

For reference, LLVM seems to use different instructions and slightly worse 
code. I am not so familiar with x86 assemble code. The patch is originally for 
our private port.
test1:                                  # @test1
        .cfi_startproc
# BB#0:                                 # %entry
        addq    $240, %rdi
        xorl    %eax, %eax
        .align  16, 0x90
.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        movdqu  (%rsi,%rax,2), %xmm0
        movdqu  (%rdx,%rax,2), %xmm1
        paddw   %xmm0, %xmm1
        shufpd  $1, %xmm1, %xmm1        # xmm1 = xmm1[1,0]
        pshuflw $27, %xmm1, %xmm0       # xmm0 = xmm1[3,2,1,0,4,5,6,7]
        pshufhw $27, %xmm0, %xmm0       # xmm0 = xmm0[0,1,2,3,7,6,5,4]
        movdqu  %xmm0, (%rdi)
        addq    $8, %rax
        addq    $-16, %rdi
        cmpq    $128, %rax
        jne     .LBB0_1
# BB#2:                                 # %for.end
        ret

Any comment?
Looks good to me.  One of the various TODOs in vectorizable_store I presume.

Needs a testcase and at this stage a bugreport that is fixed by it.

Thanks,
Richard.

Bingfeng Mei
Broadcom UK

>


Reply via email to