https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115102
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- I believe this might be the middle-end using bswap32 (you can try to confirm for SH by looking at the dump generated by -fdump-tree-optimized). For x86_64 we get uint32_t bswap8 (uint32_t val) { unsigned int _1; unsigned int bswapdst_4; uint32_t _8; unsigned int _10; unsigned int bswapmaskdst_11; <bb 2> [local count: 1073741824]: _1 = val_7(D) & 4294901760; bswapdst_4 = __builtin_bswap32 (val_7(D)); bswapmaskdst_11 = bswapdst_4 & 4294901760; _10 = bswapmaskdst_11 r>> 16; _8 = _1 | _10; return _8; and a similar bswap8: .LFB0: .cfi_startproc movl %edi, %eax xorw %di, %di bswap %eax shrl $16, %eax orl %edi, %eax ret though on x86 there's no high word preserving swap of the lower 2 bytes.