https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

--- Comment #7 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to ptomsich from comment #6)
> With the current master, the test case generates (with -mcpu=neoverse-n1):

> which contrasts with LLVM13 (with -mcpu=neoverse-n1):
> 
> test_slp:                               // @test_slp
>       .cfi_startproc
> // %bb.0:                               // %entry
>       ldr     q0, [x0]
>       movi    v1.16b, #1
>       movi    v2.2d, #0000000000000000
>       udot    v2.4s, v0.16b, v1.16b
>       addv    s0, v2.4s
>       fmov    w0, s0
>       ret
> .Lfunc_end0:
>       .size   test_slp, .Lfunc_end0-test_slp
> 
> or (LLVM13 w/o the mcpu-option):
> 
>       .type   test_slp,@function
> test_slp:                               // @test_slp
>       .cfi_startproc
> // %bb.0:                               // %entry
>       ldr     q0, [x0]
>       ushll2  v1.8h, v0.16b, #0
>       ushll   v0.8h, v0.8b, #0
>       uaddl2  v2.4s, v0.8h, v1.8h
>       uaddl   v0.4s, v0.4h, v1.4h
>       add     v0.4s, v0.4s, v2.4s
>       addv    s0, v0.4s
>       fmov    w0, s0
>       ret
> .Lfunc_end0:
>       .size   test_slp, .Lfunc_end0-test_slp

It's definitely a neat trick, but correct me if I'm wrong: it's only possible
because addition is commutative.

Clang has just simply reordered the loads because the loop is very simple to
just

    for( int i = 0; i < 4; i++, b += 4 )
    {
            tmp[i][0] = b[0];
            tmp[i][1] = b[1];
            tmp[i][2] = b[2];
            tmp[i][3] = b[3];
    }

Which GCC also handles fine. 

As Richi mentioned before

>I know the "real" code this testcase is from has actual operations
> in place of the b[N] reads, for the above vectorization looks somewhat
> pointless given we end up decomposing the result again.

It seems a bit of a too narrow focus to optimize for this particular example as
the real code does "other" things.

i.e.

Both GCC and Clang fall apart with

int test_slp( unsigned char *b )
{
    unsigned int tmp[4][4];
    int sum = 0;
    for( int i = 0; i < 4; i++, b += 4 )
    {
            tmp[i][0] = b[0] - b[4];
            tmp[i][2] = b[1] + b[5];
            tmp[i][1] = b[2] - b[6];
            tmp[i][3] = b[3] + b[7];
    }
    for( int i = 0; i < 4; i++ )
    {
            sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i];
    }
    return sum;
}

which has about the same access pattern as the real code.

If you change the operations you'll notice that for others examples like

int test_slp( unsigned char *b )
{
    unsigned int tmp[4][4];
    int sum = 0;
    for( int i = 0; i < 4; i++, b += 4 )
    {
            tmp[i][0] = b[0] - b[4];
            tmp[i][2] = b[1] - b[5];
            tmp[i][1] = b[2] - b[6];
            tmp[i][3] = b[3] - b[7];
    }
    for( int i = 0; i < 4; i++ )
    {
            sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i];
    }
    return sum;
}

GCC handles this better (but we are let down by register allocation).

To me it seems quite unlikely that actual code would be written like that, but
I guess there could be a case to be made to try to reassoc loads as well.

Reply via email to