https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
--- Comment #19 from Mateusz Guzik <mjguzik at gmail dot com> --- The results in PR 95435 look suspicious to me, so I had a better look at the bench script and I'm confident it is bogus. The compiler emits ops sized 0..2 * n - 1, where n is the reported block size. For example memcpy of block size == 4 gives me this: LD_PRELOAD=$PWD/badops.so ./_a.out_4_-mstringop-strategy=libcall [snip] 0x55a3fa8922bc -> 0x55a3fa0922bc 0 0x55a3fa9922bd -> 0x55a3fa4922bd 1 0x55a3faa922be -> 0x55a3fa8922be 2 0x55a3fab922bf -> 0x55a3f9c922bf 3 0x55a3f9c922c0 -> 0x55a3fa0922c0 4 0x55a3f9d922c1 -> 0x55a3fa4922c1 5 0x55a3f9e922c2 -> 0x55a3fa8922c2 6 0x55a3f9f922c3 -> 0x55a3f9c922c3 7 0x55a3fa0922c4 -> 0x55a3fa0922c4 0 0x55a3fa1922c5 -> 0x55a3fa4922c5 1 0x55a3fa2922c6 -> 0x55a3fa8922c6 2 0x55a3fa3922c7 -> 0x55a3f9c922c7 3 0x55a3fa4922c8 -> 0x55a3fa0922c8 4 0x55a3fa5922c9 -> 0x55a3fa4922c9 5 0x55a3fa6922ca -> 0x55a3fa8922ca 6 0x55a3fa7922cb -> 0x55a3f9c922cb 7 [snip] Same thing for bigger sizes. Even if the goal was to iterate on sizes up to the block size, the upper limit is twice the reported one. Was the intent to bench what to do if the size is not known at compilation time, but does have a known upper bound? I am not going to make any comments on what to do in that case. Perhaps this is once more mismatched assumptions, so I'm also going to add that the case I'm arguing for deals with sizes *known* at compilation time. Thus whatever asm snippet, will *always* get the same size. See my Intel and AMD result above from a non-microbenchmark. Few hotspots suffering > 128 byte memset or memcpy are exercised there and experience a speed up from using an unrolled 32-byte per iteration stores.