https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809
Peter Cordes <peter at cordes dot ca> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |peter at cordes dot ca --- Comment #4 from Peter Cordes <peter at cordes dot ca> --- Yes, rep scasb is abysmal, and gcc -O3's 4-byte-at-a-time scalar loop is not very good either. With 16-byte alignment, (which we have from calloc on x86-64 System V), we can inline a *much* better SSE2 loop. See https://stackoverflow.com/a/55589634/224132 for more details and microbenchmarks; On Skylake it's about 4 to 5x faster than the current 4-byte loop for large strings, 3x faster for short strings. For short strings (strlen=33), it's about 1.5x faster than calling strlen. For very large strings (too big for L2 cache), it's ~1.7x slower than glibc's AVX2 strlen. The lack of VEX encoding for pxor and pmovmskb is just me being lazy; let gcc emit them all with VEX if AVX is enabled. # at this point gcc has `s` in RDX, `i` in ECX pxor %xmm0, %xmm0 # zeroed vector to compare against .p2align 4 .Lstrlen16: # do { #ifdef __AVX__ vpcmpeqb (%rdx), %xmm0, %xmm1 #else movdqa (%rdx), %xmm1 pcmpeqb %xmm0, %xmm1 # xmm1 = -1 where there was a 0 in memory #endif add $16, %rdx # ptr++ pmovmskb %xmm1, %eax # extract high bit of each byte to a 16-bit mask test %eax, %eax jz .Lstrlen16 # }while(mask==0); # RDX points at the 16-byte chunk *after* the one containing the terminator # EAX = bit-mask of the 0 bytes, and is known to be non-zero bsf %eax, %eax # EAX = bit-index of the lowest set bit # terminator is at rdx+rax - 16 # movb $'A', -16(%rdx, %rax) // for a microbench that used s[strlen(s)]='A' sub %rbp, %rdx # p -= start lea -16(%rdx, %rax) # p += byte_within_vector - 16 We should actually use REP BSF because that's faster on AMD (tzcnt), and same speed on Intel. Also an inline-asm implementation of it with a microbenchmark adapted from the SO question. (Compile with -DUSE_ASM -DREAD_ONLY to benchmark a fixed length repeatedly) https://godbolt.org/z/9tuVE5 It uses clock() for timing, which I didn't bother updating. I made it possible to run it for lots of iterations for consistent timing. (And so the real work portion dominates the runtime so we can use perf stat to measure it.) ---- If we only have 4-byte alignment, maybe check the first 4B, then do (p+4) & ~7 to either overlap that 4B again or not when we start 8B chunks. But probably it's good to get to 16-byte alignment and do whole SSE2 vectors, because repeating an aligned 16-byte test that overlaps an 8-byte test costs the same as doing another 8-byte test. (Except on CPUs like Bobcat that split 128-bit vectors into 64-bit halves). The extra AND to round down to an alignment boundary is all it takes, plus the code-size cost of peeling 1 iteration each of 4B and 8B before a 16-byte loop. We can use 4B / 8B with movd / movq instead of movdqa. For pmovmskb, we can ignore the compare-true results for the upper 8 bytes by testing the result with `test %al,%al`, or in general with `test $0x0F, %al` to check only the low 4 bits of EAX for the 4-byte case. ---- The scalar bithack version can use BSF instead of CMOV binary search for the byte with a set high bit. That should be a win if we ever wanted to do scalar on some x86 target especially with 8-byte registers, or on AArch64. AArch64 can rbit / clz to emulate bsf and find the position of the first set bit. (Without efficient SIMD compare result -> integer_mask, or efficient SIMD -> integer at all on some ARM / AArch64 chips, SIMD compares for search loops aren't always (ever?) a win. IIRC, glibc strlen and memchr don't use vectors on ARM / AArch64, just scalar bithacks.)