http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769

           Summary: [missed optimization] use of btr (bit test and reset)
           Product: gcc
           Version: 4.5.0
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: target
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: kr...@kde.org


The code:

tmp &= ~(1 << bit);

gets translated to

actual shift, not, and and instructions. Instead GCC could emit one btr
instruction (which modifies the flags - unwanted but acceptable):

btr %[bit], %[tmp]

The btr instruction has a latency of 1 cycle and throughput of 0.5 cycles on
all recent Intel CPUs and thus outperforms the shift + not + and combination.

Rationale:
I make use of this pattern for iteration over a bitmask. I use bsf
(_bit_scan_forward(bitmask)) to find the lowest set bit. To find the next one I
have to mask off the last found bit and currently have to use inline assembly
to get a btr instruction there. Alternatively an intrinsic for btr and friends
might make sense.

Reply via email to