https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069
--- Comment #1 from Thiago Macieira <thiago at kde dot org> --- (the assembly doesn't match the source code, but we got your point) Another possible improvement for the __atomic_fetch_{and,nand,or} functions is that it can check whether the fetched value is already correct and branch out. In your example, the __atomic_fetch_or with 0x40000000 can check if that bit is already set and, if so, not execute the CMPXCHG at all. This is a valid solution for x86 on memory orderings up to acq_rel. For other architectures, they may still need barriers. For seq_cst, we either need a barrier or we need to execute the CMPXCHG at least once. Therefore, the emitted code might want to optimistically execute the operation once and, if it fails, enter the load loop. That's a slightly longer codegen. Whether we want that under -Os or not, you'll have to be the judge. Prior art: glibc/sysdeps/x86_64/nptl/pthread_spin_lock.S: ENTRY(__pthread_spin_lock) 1: LOCK decl 0(%rdi) jne 2f xor %eax, %eax ret .align 16 2: rep nop cmpl $0, 0(%rdi) jg 1b jmp 2b END(__pthread_spin_lock) This does the atomic operation once, hoping it'll succeed. If it fails, it enters the PAUSE+CMP+JG loop until the value is suitable.