http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55147
--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-10-31 16:07:11 UTC --- For the testcase from this PR it creates better assembly actually (compared to with the #c1 patch, without that it is both longer and wrong). That is because when bswapdi is split too late, nothing optimizes the fact that only 32 bits of the result are used. For unsigned long long f1 (unsigned long long *p, int i) { return __builtin_bswap64 (p[i]); } unsigned long long f2 (unsigned long long p) { return __builtin_bswap64 (p); } void f3 (unsigned long long *p, int i, unsigned long long q) { p[i] = __builtin_bswap64 (q); } void f4 (unsigned long long *p, int i, unsigned long long *q) { p[i] = __builtin_bswap64 (q[i]); } it creates the same number of insns/same quality (just slightly different RA decisions/scheduling) for f1-f3, but for f4 without bswapdi2 it creates slightly worse code (with bswapdi2 f4 needs just one call saved register, without it two, supposedly because both bswap insns are scheduled together.