> On Sun, Apr 20, 2025 at 4:19 AM Jan Hubicka <hubi...@ucw.cz> wrote: > > > > > On Tue, Apr 8, 2025 at 3:52 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > > > Simplify memcpy and memset inline strategies to avoid branches for > > > > -mtune=generic: > > > > > > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector > > > > load and store for up to 16 * 16 (256) bytes when the data size is > > > > fixed and known. > > > > Originally we set CLEAR_RATION smaller than MOVE_RATIO because to store > > zeros we use: > > > > 0: 48 c7 07 00 00 00 00 movq $0x0,(%rdi) > > 7: 48 c7 47 08 00 00 00 movq $0x0,0x8(%rdi) > > e: 00 > > f: 48 c7 47 10 00 00 00 movq $0x0,0x10(%rdi) > > 16: 00 > > 17: 48 c7 47 18 00 00 00 movq $0x0,0x18(%rdi) > > 1e: 00 > > > > so about 8 bytes per instructions. We could optimize it by loading 0 > > This is orthogonal to this patch.
True, I mentioned it mostly because ... > > > to scratch register but we don't. SSE variant is shorter: > > > > 4: 0f 11 07 movups %xmm0,(%rdi) > > 7: 0f 11 47 10 movups %xmm0,0x10(%rdi) > > > > So I wonder if we care about code size with -mno-sse (i.e. for building > > kernel). > > This patch doesn't change -Os behavior which uses x86_size_cost, > not generic_cost. ... we need to make code size/speed tradeoffs even at -O2 (and partly -O3). A sequence of 17 moves in integer will be 136 bytes of code while with sse it will be 68 bytes. It would be nice to understand how often it pays back compared to shorter sequences. > > SPEC is not very sensitive to string op implementation. I wonder if you > > have specific testcases where using loop variant for very small blocks > > is a loss? > > For small blocks with known sizes, loop is slower because of branches. For known size we shoudl use a sequence of moves (up to MOVE/COPY ratio). Even with current setting of 6 we should be able to copy all block of size <32. To copy block of 32 we need 4 64bit moves or 2 128bit moves. #include <string.h> char *src; char *dest; char *dest2; void test () { memcpy (dest, src, 31); } void test2 () { memset (dest2, 0, 31); } compiles to test: movq src(%rip), %rdx movq dest(%rip), %rax movdqu (%rdx), %xmm0 movups %xmm0, (%rax) movdqu 15(%rdx), %xmm0 movups %xmm0, 15(%rax) ret test2: movq dest2(%rip), %rax pxor %xmm0, %xmm0 movups %xmm0, (%rax) movups %xmm0, 15(%rax) ret The copy algorithm tables are mostly used when the block size is greater than what we can copy/set by COPY/CLEAR_RATIO or when we know expected size by profile feedback. In relatively relatively rare cases when value ranges delivers useful range we use it too. (Some work would be needed to make this work better). But it works on simple testcases. For example: #include <string.h> void test (char *dest, int n) { memset (dest, 0, 30 + (n != 0)); } compiles to a loop instead of library call since we know that the code will copy 30 or 31 bytes. This testcase should be copmiled too: #include <string.h> char dest[31]; void test (int n) { memset (dest, 0, n); } since the upper bound on the block size is 31 bytes, but we fail to detect that. Honza > > > We are also better on picking codegen choice with PGO since we > > value-profile size of the block. > > > > Inlining memcpy is bigger win in situation where it prevents spilling > > data from caller saved registers. This makes it a bit hard to guess how > > microbenchmarks relate to more real-world situations where the > > surrounding code may need to hold data in SSE regs etc. > > If we had a special entry-point to memcpy/memset that does not clobber > > registers and does its own callee save, this problem would go away... > > > > Honza > > > > -- > H.J.