https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707
--- Comment #10 from alalaw01 at gcc dot gnu.org --- This causes to FAIL the scan-tree-dump-times 'vectorizing stmts using SLP' in slp-perm-{1,2,3,5,6,7,8,11}.c. Looking at the assembler before and after... slp-perm-1.c: this looks a big win; several st3's are generated instead of many stp's, we lose all the tbl's, and many constant-pool entries consisting of 'byte's are removed, with the corresponding ADRP's. The loop is fully unrolled in both cases, and the new code is much shorter (48 instructions rather than 95). slp-perm-2.c: less clear, but looks like an overall win. Loop gets unrolled by factor of 2; each "half" loses a TRN1 and a TRN2 but gains an ORR (move). slp-perm-3.c: Again we lose a load of constants and ADRPs (outside the 4-iteration loop), gaining some MOVIs. With the patchlet, the loop gets fully unrolled, and loses 4*tbl per iteration (!). Still executing 8*mul, 8*mla, 4*add, but dropping the TBLs again makes for a win. slp-perm-5: less clear, but again looks like an overall win. Both loops have been fully unrolled, and the combining of stores doesn't help much (we seem to gain as many moves as we lose stores!). but with the patch, we lose several TBLs and TRNs. Also an MLA becomes a MUL. A side comment would be that if we could 'fix' the register allocation here, to put things into the right place ready for the stN rather than moving it there later, we'd have quite a big win...but that's another issue. Also a recurring theme is that the vec_(load/store)_lanes approach seems to make much better use of movi, rather than pushing things into the constant pool. I haven't really looked into this, it may be fundamental, or just a limitation of our current code for loading immediates. slp-perm-6: some wins from constants, and dropping 8 tbls. slp-perm-7.c: Similarly. slp-perm-8.c: Loop here iterates 4 times, and the ld3/st3 manages to lose us 4*move and 9*tbl per iteration (!); huge improvement. slp-perm-11.c: a 16-iteration loop gets unrolled *2, and now uses an st2, but no load_lanes, just a bunch of ldr's: 10 rather than the original 3(*2). 3 strs become 4 stp's (+st2). Doesn't look like an improvement! However, 7 out of 8 cases look better, in some cases much better. So I'd say that was a definite codegen improvement :).