https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49244
--- Comment #13 from dhowells at redhat dot com <dhowells at redhat dot com> --- Very nice:-) There are a couple of under optimisations yet. Firstly: #define BITS_PER_LONG (sizeof(long) * 8) #define _BITOPS_LONG_SHIFT 6 static __always_inline bool test_and_change_bit(long bit, volatile unsigned long *ptr) { unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1)); unsigned long old; ptr += bit >> _BITOPS_LONG_SHIFT; old = __atomic_fetch_xor(ptr, mask, __ATOMIC_SEQ_CST); return old & mask; } bool change_bit_3(unsigned long *p, long n) { return test_and_change_bit(n, p); } is compiled to: 0000000000000048 <change_bit_3>: 48: 48 89 f0 mov %rsi,%rax 4b: 83 e6 3f and $0x3f,%esi 4e: 48 c1 f8 06 sar $0x6,%rax 52: f0 48 0f bb 34 c7 lock btc %rsi,(%rdi,%rax,8) 58: 0f 92 c0 setb %al 5b: c3 retq on x86, lines 48-4e are redundant as the btc instruction will do that for you. I don't know whether it's more efficient this way or not, though. Secondly: static __always_inline bool test_bit(long bit, const unsigned long *ptr) { unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1)); unsigned long old; ptr += bit >> _BITOPS_LONG_SHIFT; old = __atomic_load_n(ptr, __ATOMIC_RELAXED); return old & mask; } bool read_bit(unsigned long *p) { return test_bit(3, p); } is compiled to: 0000000000000000 <read_bit>: 0: 48 8b 07 mov (%rdi),%rax 3: 48 c1 e8 03 shr $0x3,%rax 7: 83 e0 01 and $0x1,%eax a: c3 retq but could actually be either a TEST instruction or a BT instruction. Still, thanks very much for looking at this!