https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96633

--- Comment #2 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Martin added me to CC so I assume he wants me to chime in.

First of all, I find Nathan's behavior in that gcc@ thread distasteful at best
(but if you ask me, such responses are simply more harm than good; link:
https://lwn.net/ml/gcc/1a363f89-6f98-f583-e22a-a7fc02efb...@acm.org/ ).

Next, statements like "I've determined the following is abot 12% faster" don't
carry weight without details such as the CPU family, structure of the benchmark
and the workload. Obviously, on input that lacks whitespace GCC's original code
is faster as the initial branch is 100% predictable. Likewise, if the input was
taken from /dev/random, the 12% figure is irrelevant to real-world uses of such
code. What the benchmark is doing with the return value of the function also
matters a lot.

With that out of the way: striving to get efficient branchless code on this
code is not very valuable in practice, because the caller is likely to perform
a conditional branch on the result anyway. So making isWhitespace branchless
simply moves the misprediction cost to the caller, making the overall code
slower.

(but of course such considerations are too complex for the compiler's limited
brain)

In general such "bitmask tests" will benefit from the BT instruction on x86
(not an extension, was in the ISA since before I was born), plus CMOV to get
the right mask if it doesn't fit in a register.

For 100% branchless code we want to generate code similar to:

char is_ws(char c)
{
        unsigned long long mask = 1ll<<' ' | 1<<'\t' | 1<<'\r' | 1<<'\n';
        unsigned long long v = c;
        if (v > 32)
#if 1
            mask = 0;
#else
            return 0;
#endif
        char r;
        asm("bt %1, %2; setc %0" : "=r"(r) : "r"(v), "r"(mask));
        return r;
}

        movsbq  %dil, %rax
        movl    $0, %edx
        movabsq $4294977024, %rdi
        cmpq    $33, %rax
        cmovnb  %rdx, %rdi
        bt %rax, %rdi; setc %al
        ret

(note we get %edx zeroing suboptimal, should have used xor %edx, %edx)

This is generalizable to any input type, not just char.

We even already get the "test against a mask" part of the idea right ;)

Branchy testing is even cheaper with BT:

void is_ws_cb(unsigned char c, void f(void))
{
        unsigned long long mask = 1ll<<' ' | 1<<'\t' | 1<<'\r' | 1<<'\n';
        if (c <= 32 && (mask & (1ll<<c)))
            f();
}

        cmpb    $32, %dil
        ja      .L5
        movabsq $4294977024, %rax
        btq     %rdi, %rax
        jnc     .L5
        jmp     *%rsi
.L5:
        ret

Reply via email to