https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65456
Bug ID: 65456 Summary: powerpc64le autovectorized copy loop missed optimization Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: anton at samba dot org Created attachment 35049 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35049&action=edit Testcase pulled from valgrind The attached copy loop (out of valgrind) produces some pretty bad code: df8: e4 06 9e 78 rldicr r30,r4,0,59 dfc: e4 26 df 78 rldicr r31,r6,4,59 e00: 10 00 84 38 addi r4,r4,16 e04: 01 00 c6 38 addi r6,r6,1 e08: 99 f6 20 7c lxvd2x vs33,0,r30 e0c: 57 0a 21 f0 xxswapd vs33,vs33 e10: 2b 03 a1 11 vperm v13,v1,v0,v12 e14: 97 0c 01 f0 xxlor vs32,vs33,vs33 e18: 56 6a 0d f0 xxswapd vs0,vs45 e1c: 98 4f 1f 7c stxvd2x vs0,r31,r9 e20: d8 ff 00 42 bdnz df8 <memmove+0x6e8> Since we are using VSX storage ops, we should just align the source and do unaligned stores. That will remove the permute, and then the gcc pass to remove redundant swaps should kick in and remove them too.