rep stosq for inlined ops

mjguzik at gmail dot com via Gcc-bugs Fri, 04 Apr 2025 16:30:16 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596


--- Comment #19 from Mateusz Guzik <mjguzik at gmail dot com> ---
The results in PR 95435 look suspicious to me, so I had a better look at the
bench script and I'm confident it is bogus.

The compiler emits ops sized 0..2 * n - 1, where n is the reported block size.

For example memcpy of block size == 4 gives me this:
LD_PRELOAD=$PWD/badops.so ./_a.out_4_-mstringop-strategy=libcall
[snip]
0x55a3fa8922bc -> 0x55a3fa0922bc 0
0x55a3fa9922bd -> 0x55a3fa4922bd 1
0x55a3faa922be -> 0x55a3fa8922be 2
0x55a3fab922bf -> 0x55a3f9c922bf 3
0x55a3f9c922c0 -> 0x55a3fa0922c0 4
0x55a3f9d922c1 -> 0x55a3fa4922c1 5
0x55a3f9e922c2 -> 0x55a3fa8922c2 6
0x55a3f9f922c3 -> 0x55a3f9c922c3 7
0x55a3fa0922c4 -> 0x55a3fa0922c4 0
0x55a3fa1922c5 -> 0x55a3fa4922c5 1
0x55a3fa2922c6 -> 0x55a3fa8922c6 2
0x55a3fa3922c7 -> 0x55a3f9c922c7 3
0x55a3fa4922c8 -> 0x55a3fa0922c8 4
0x55a3fa5922c9 -> 0x55a3fa4922c9 5
0x55a3fa6922ca -> 0x55a3fa8922ca 6
0x55a3fa7922cb -> 0x55a3f9c922cb 7
[snip]

Same thing for bigger sizes.

Even if the goal was to iterate on sizes up to the block size, the upper limit
is twice the reported one.

Was the intent to bench what to do if the size is not known at compilation
time, but does have a known upper bound?

I am not going to make any comments on what to do in that case.

Perhaps this is once more mismatched assumptions, so I'm also going to add that
the case I'm arguing for deals with sizes *known* at compilation time. Thus
whatever asm snippet, will *always* get the same size.

See my Intel and AMD result above from a non-microbenchmark. Few hotspots
suffering > 128 byte memset or memcpy are exercised there and experience a
speed up from using an unrolled 32-byte per iteration stores.

[Bug target/119596] x86: too eager use of rep movsq/rep stosq for inlined ops

Reply via email to