https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115102
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
I believe this might be the middle-end using bswap32 (you can try to confirm
for SH by looking at the dump generated by -fdump-tree-optimized).
For x86_64 we get
uint32_t bswap8 (uint32_t val)
{
unsigned int _1;
unsigned int bswapdst_4;
uint32_t _8;
unsigned int _10;
unsigned int bswapmaskdst_11;
<bb 2> [local count: 1073741824]:
_1 = val_7(D) & 4294901760;
bswapdst_4 = __builtin_bswap32 (val_7(D));
bswapmaskdst_11 = bswapdst_4 & 4294901760;
_10 = bswapmaskdst_11 r>> 16;
_8 = _1 | _10;
return _8;
and a similar
bswap8:
.LFB0:
.cfi_startproc
movl %edi, %eax
xorw %di, %di
bswap %eax
shrl $16, %eax
orl %edi, %eax
ret
though on x86 there's no high word preserving swap of the lower 2 bytes.