https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492
--- Comment #7 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to ptomsich from comment #6) > With the current master, the test case generates (with -mcpu=neoverse-n1): > which contrasts with LLVM13 (with -mcpu=neoverse-n1): > > test_slp: // @test_slp > .cfi_startproc > // %bb.0: // %entry > ldr q0, [x0] > movi v1.16b, #1 > movi v2.2d, #0000000000000000 > udot v2.4s, v0.16b, v1.16b > addv s0, v2.4s > fmov w0, s0 > ret > .Lfunc_end0: > .size test_slp, .Lfunc_end0-test_slp > > or (LLVM13 w/o the mcpu-option): > > .type test_slp,@function > test_slp: // @test_slp > .cfi_startproc > // %bb.0: // %entry > ldr q0, [x0] > ushll2 v1.8h, v0.16b, #0 > ushll v0.8h, v0.8b, #0 > uaddl2 v2.4s, v0.8h, v1.8h > uaddl v0.4s, v0.4h, v1.4h > add v0.4s, v0.4s, v2.4s > addv s0, v0.4s > fmov w0, s0 > ret > .Lfunc_end0: > .size test_slp, .Lfunc_end0-test_slp It's definitely a neat trick, but correct me if I'm wrong: it's only possible because addition is commutative. Clang has just simply reordered the loads because the loop is very simple to just for( int i = 0; i < 4; i++, b += 4 ) { tmp[i][0] = b[0]; tmp[i][1] = b[1]; tmp[i][2] = b[2]; tmp[i][3] = b[3]; } Which GCC also handles fine. As Richi mentioned before >I know the "real" code this testcase is from has actual operations > in place of the b[N] reads, for the above vectorization looks somewhat > pointless given we end up decomposing the result again. It seems a bit of a too narrow focus to optimize for this particular example as the real code does "other" things. i.e. Both GCC and Clang fall apart with int test_slp( unsigned char *b ) { unsigned int tmp[4][4]; int sum = 0; for( int i = 0; i < 4; i++, b += 4 ) { tmp[i][0] = b[0] - b[4]; tmp[i][2] = b[1] + b[5]; tmp[i][1] = b[2] - b[6]; tmp[i][3] = b[3] + b[7]; } for( int i = 0; i < 4; i++ ) { sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i]; } return sum; } which has about the same access pattern as the real code. If you change the operations you'll notice that for others examples like int test_slp( unsigned char *b ) { unsigned int tmp[4][4]; int sum = 0; for( int i = 0; i < 4; i++, b += 4 ) { tmp[i][0] = b[0] - b[4]; tmp[i][2] = b[1] - b[5]; tmp[i][1] = b[2] - b[6]; tmp[i][3] = b[3] - b[7]; } for( int i = 0; i < 4; i++ ) { sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i]; } return sum; } GCC handles this better (but we are let down by register allocation). To me it seems quite unlikely that actual code would be written like that, but I guess there could be a case to be made to try to reassoc loads as well.