O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

alalaw01 at gcc dot gnu.org Fri, 11 Dec 2015 07:06:13 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707


--- Comment #10 from alalaw01 at gcc dot gnu.org ---
This causes to FAIL the scan-tree-dump-times 'vectorizing stmts using SLP' in
slp-perm-{1,2,3,5,6,7,8,11}.c. Looking at the assembler before and after...

slp-perm-1.c: this looks a big win; several st3's are generated instead of many
stp's, we lose all the tbl's, and many constant-pool entries consisting of
'byte's are removed, with the corresponding ADRP's. The loop is fully unrolled
in both cases, and the new code is much shorter (48 instructions rather than
95).

slp-perm-2.c: less clear, but looks like an overall win. Loop gets unrolled by
factor of 2; each "half" loses a TRN1 and a TRN2 but gains an ORR (move).

slp-perm-3.c: Again we lose a load of constants and ADRPs (outside the
4-iteration loop), gaining some MOVIs. With the patchlet, the loop gets fully
unrolled, and loses 4*tbl per iteration (!). Still executing 8*mul, 8*mla,
4*add, but dropping the TBLs again makes for a win.

slp-perm-5: less clear, but again looks like an overall win. Both loops have
been fully unrolled, and the combining of stores doesn't help much (we seem to
gain as many moves as we lose stores!). but with the patch, we lose several
TBLs and TRNs. Also an MLA becomes a MUL.

A side comment would be that if we could 'fix' the register allocation here, to
put things into the right place ready for the stN rather than moving it there
later, we'd have quite a big win...but that's another issue.

Also a recurring theme is that the vec_(load/store)_lanes approach seems to
make much better use of movi, rather than pushing things into the constant
pool. I haven't really looked into this, it may be fundamental, or just a
limitation of our current code for loading immediates.

slp-perm-6: some wins from constants, and dropping 8 tbls.

slp-perm-7.c: Similarly.

slp-perm-8.c: Loop here iterates 4 times, and the ld3/st3 manages to lose us
4*move and 9*tbl per iteration (!); huge improvement.

slp-perm-11.c: a 16-iteration loop gets unrolled *2, and now uses an st2, but
no load_lanes, just a bunch of ldr's: 10 rather than the original 3(*2). 3 strs
become 4 stp's (+st2). Doesn't look like an improvement!

However, 7 out of 8 cases look better, in some cases much better. So I'd say
that was a definite codegen improvement :).

[Bug tree-optimization/68707] [6 Regression] testcase gcc.dg/vect/O3-pr36098.c vectorized using VEC_PERM_EXPR rather than VEC_LOAD_LANES

Reply via email to