--- .c ---
int ispowerof2(unsigned long long argument) {
    return __builtin_popcountll(argument) == 1;
}
--- EOF ---

GCC 13.3    gcc -m32 -march=alderlake -O3
            gcc -m32 -march=sapphirerapids -O3
            gcc -m32 -mpopcnt -mtune=sapphirerapids -O3

https://gcc.godbolt.org/z/cToYrrYPq
ispowerof2(unsigned long long):
        xor     eax, eax        # superfluous
        xor     edx, edx        # superfluous
        popcnt  eax, [esp+4]
        popcnt  edx, [esp+8]
        add     eax, edx
        cmp     eax, 1      ->    dec  eax
        sete    al
        movzx   eax, al         # superfluous
        ret

9 instructions in 28 bytes      # 6 instructions in 20 bytes

OUCH: popcnt writes the WHOLE result register, there is ABSOLUTELY
      no need to clear it beforehand nor to clear the higher 24 bits
      afterwards!

JFTR: before GCC zealots write nonsense: see -march= or -mtune=

GCC 13.3    gcc -mpopcnt -mtune=barcelona -O3

https://gcc.godbolt.org/z/3Ks8vh7a6
ispowerof2(unsigned long long):
        popcnt  rdi, rdi    ->    popcnt  rax, rdi
        xor     eax, eax        # superfluous!
        dec     edi         ->    dec     eax
        sete    al          ->    setz    al
        ret

GCC 13.3    gcc -m32 -mpopcnt -mtune=barcelona -O3

https://gcc.godbolt.org/z/s5s5KTGnv
ispowerof2(unsigned long long):
        popcnt  eax, [esp+4]
        popcnt  edx, [esp+8]
        add     eax, edx
        dec     eax
        sete    al
        movzx   eax, al        # superfluous!
        ret

Will GCC eventually generate properly optimised code instead of bloat?

Stefan

Reply via email to