https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94837
Bug ID: 94837 Summary: Failure to optimize out spurious movbe into bswap Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: gabravier at gmail dot com Target Milestone: --- float swapFloat(float x) { union { float f; uint32_t u32; } swapper; swapper.f = x; swapper.u32 = __builtin_bswap32(swapper.u32); return swapper.f; } For this function, on x86-64 with `-O3 -mmovbe`, LLVM outputs this : swapFloat(float): # @swapFloat(float) movd eax, xmm0 bswap eax movd xmm0, eax ret GCC instead outputs this : swapFloat(float): movd DWORD PTR [rsp-4], xmm0 movbe eax, DWORD PTR [rsp-4] movd xmm0, eax ret It seems highly likely to me that a spill to memory is much slower than a direct `bswap`.