https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837
--- Comment #3 from Peter Cordes <peter at cordes dot ca> --- Seems to be fixed in gcc7.2.0: https://godbolt.org/g/jRwtZN gcc7.2 is fine with -m32, -mx32, and -m64, but x32 is the most compact. -m64 just calls __atomic_load_16 gcc7.2 -O3 -mx32 output: follow_nounion(std::atomic<counted_ptr>*): movq (%edi), %rax movl %eax, %eax ret vs. gcc7.1 -O3 -mx32 follow_nounion(std::atomic<counted_ptr>*): movq (%edi), %rcx xorl %edx, %edx movzbl %ch, %eax movb %cl, %dl movq %rcx, %rsi movb %al, %dh andl $16711680, %esi andl $4278190080, %ecx movzwl %dx, %eax orq %rsi, %rax orq %rcx, %rax ret ------- gcc7.2 -O3 -m64 just forwards its arg to __atomic_load_16 and then returns: follow_nounion(std::atomic<counted_ptr>*): subq $8, %rsp movl $2, %esi call __atomic_load_16 addq $8, %rsp ret It unfortunately doesn't optimize the tail-call to movl $2, %esi jmp __atomic_load_16 presumably because it hasn't realized early enough that it takes zero instructions to extract the 8-byte low half of the 16-byte __atomic_load_16 return value.