https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca

--- Comment #4 from Peter Cordes <peter at cordes dot ca> ---
Yes, rep scasb is abysmal, and gcc -O3's 4-byte-at-a-time scalar loop is not
very good either.

With 16-byte alignment, (which we have from calloc on x86-64 System V), we can
inline a *much* better SSE2 loop.  See
https://stackoverflow.com/a/55589634/224132 for more details and
microbenchmarks; 

On Skylake it's about 4 to 5x faster than the current 4-byte loop for large
strings, 3x faster for short strings.  For short strings (strlen=33), it's
about 1.5x faster than calling strlen.  For very large strings (too big for L2
cache), it's ~1.7x slower than glibc's AVX2 strlen.

The lack of VEX encoding for pxor and pmovmskb is just me being lazy; let gcc
emit them all with VEX if AVX is enabled.

   # at this point gcc has `s` in RDX, `i` in ECX

    pxor       %xmm0, %xmm0         # zeroed vector to compare against
    .p2align 4
.Lstrlen16:                         # do {
#ifdef __AVX__
    vpcmpeqb   (%rdx), %xmm0, %xmm1
#else
    movdqa     (%rdx), %xmm1
    pcmpeqb    %xmm0, %xmm1           # xmm1 = -1 where there was a 0 in memory
#endif

    add         $16, %rdx             # ptr++
    pmovmskb  %xmm1, %eax             # extract high bit of each byte to a
16-bit mask
    test       %eax, %eax
    jz        .Lstrlen16            # }while(mask==0);
    # RDX points at the 16-byte chunk *after* the one containing the terminator
    # EAX = bit-mask of the 0 bytes, and is known to be non-zero
    bsf        %eax, %eax           # EAX = bit-index of the lowest set bit

    # terminator is at rdx+rax - 16
    #  movb       $'A', -16(%rdx, %rax)  // for a microbench that used
s[strlen(s)]='A'
    sub        %rbp, %rdx           # p -= start
    lea       -16(%rdx, %rax)       # p += byte_within_vector - 16

We should actually use  REP BSF  because that's faster on AMD (tzcnt), and same
speed on Intel.


Also an inline-asm implementation of it with a microbenchmark adapted from the
SO question.  (Compile with -DUSE_ASM -DREAD_ONLY to benchmark a fixed length
repeatedly)
https://godbolt.org/z/9tuVE5

It uses clock() for timing, which I didn't bother updating.  I made it possible
to run it for lots of iterations for consistent timing.  (And so the real work
portion dominates the runtime so we can use perf stat to measure it.)

----


If we only have 4-byte alignment, maybe check the first 4B, then do (p+4) & ~7
to either overlap that 4B again or not when we start 8B chunks.  But probably
it's good to get to 16-byte alignment and do whole SSE2 vectors, because
repeating an aligned 16-byte test that overlaps an 8-byte test costs the same
as doing another 8-byte test.  (Except on CPUs like Bobcat that split 128-bit
vectors into 64-bit halves).  The extra AND to round down to an alignment
boundary is all it takes, plus the code-size cost of peeling 1 iteration each
of 4B and 8B before a 16-byte loop.

We can use 4B / 8B with movd / movq instead of movdqa.  For pmovmskb, we can
ignore the compare-true results for the upper 8 bytes by testing the result
with `test %al,%al`, or in general with `test $0x0F, %al` to check only the low
4 bits of EAX for the 4-byte case.

----

The scalar bithack version can use BSF instead of CMOV binary search for the
byte with a set high bit.  That should be a win if we ever wanted to do scalar
on some x86 target especially with 8-byte registers, or on AArch64.  AArch64
can rbit / clz to emulate bsf and find the position of the first set bit.

(Without efficient SIMD compare result -> integer_mask, or efficient SIMD ->
integer at all on some ARM / AArch64 chips, SIMD compares for search loops
aren't always (ever?) a win.  IIRC, glibc strlen and memchr don't use vectors
on ARM / AArch64, just scalar bithacks.)

Reply via email to