https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46091
--- Comment #10 from Uroš Bizjak <ubizjak at gmail dot com> --- (In reply to Avi Kivity from comment #9) > I believe the comment is wrong. Here's what the manual says: > > "This instruction can be used with a LOCK prefix to allow the instruction to > be executed atomically." > > Implying that without the LOCK prefix, it is not atomic. XCHG is the only > instruction that asserts LOCK implicitly. > > Agner lists BTC reciprocal throughput as 1 for imm, mem case and 5 for reg, > mem. The latter is slow, but perhaps still worthwhile as a replacement for > the code in the first comment (but probably not when addressing a single > word). BTC/BTR/BTS with a memory operand (RMW) is indeed slower, but so are other logic instructions. Following testcase: --cut here-- extern unsigned long long a; void test (void) { a &= ~(1ull << 55); } --cut here-- should generate RMW BTR instruction. I'll look into this a bit some more. However, these insn should be rare, so do not expect any noticeable application speed-up ... > Note there is also the BT instruction (with reciprocal throughput of 0.5!) Yes, we already emit this.