http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769
Summary: [missed optimization] use of btr (bit test and reset) Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: kr...@kde.org The code: tmp &= ~(1 << bit); gets translated to actual shift, not, and and instructions. Instead GCC could emit one btr instruction (which modifies the flags - unwanted but acceptable): btr %[bit], %[tmp] The btr instruction has a latency of 1 cycle and throughput of 0.5 cycles on all recent Intel CPUs and thus outperforms the shift + not + and combination. Rationale: I make use of this pattern for iteration over a bitmask. I use bsf (_bit_scan_forward(bitmask)) to find the lowest set bit. To find the next one I have to mask off the last found bit and currently have to use inline assembly to get a btr instruction there. Alternatively an intrinsic for btr and friends might make sense.