> On Sun, Apr 20, 2025 at 4:19 AM Jan Hubicka <hubi...@ucw.cz> wrote:
> >
> > > On Tue, Apr 8, 2025 at 3:52 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> > > >
> > > > Simplify memcpy and memset inline strategies to avoid branches for
> > > > -mtune=generic:
> > > >
> > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> > > >    load and store for up to 16 * 16 (256) bytes when the data size is
> > > >    fixed and known.
> >
> > Originally we set CLEAR_RATION smaller than MOVE_RATIO because to store
> > zeros we use:
> >
> >    0:   48 c7 07 00 00 00 00    movq   $0x0,(%rdi)
> >    7:   48 c7 47 08 00 00 00    movq   $0x0,0x8(%rdi)
> >    e:   00
> >    f:   48 c7 47 10 00 00 00    movq   $0x0,0x10(%rdi)
> >   16:   00
> >   17:   48 c7 47 18 00 00 00    movq   $0x0,0x18(%rdi)
> >   1e:   00
> >
> > so about 8 bytes per instructions.   We could optimize it by loading 0
> 
> This is orthogonal to this patch.

True, I mentioned it mostly because ...
> 
> > to scratch register but we don't.  SSE variant is shorter:
> >
> >    4:   0f 11 07                movups %xmm0,(%rdi)
> >    7:   0f 11 47 10             movups %xmm0,0x10(%rdi)
> >
> > So I wonder if we care about code size with -mno-sse (i.e. for building
> > kernel).
> 
> This patch doesn't change -Os behavior which uses x86_size_cost,
> not generic_cost.

... we need to make code size/speed tradeoffs even at -O2 (and partly
-O3).  A sequence of 17 moves in integer will be 136 bytes of code while
with sse it will be 68 bytes. It would be nice to understand how often
it pays back compared to shorter sequences.

> > SPEC is not very sensitive to string op implementation.  I wonder if you
> > have specific testcases where using loop variant for very small blocks
> > is a loss?
> 
> For small blocks with known sizes, loop is slower because of branches.

For known size we shoudl use a sequence of moves (up to MOVE/COPY
ratio). Even with current setting of 6 we should be able to copy all
block of size <32.  To copy block of 32 we need 4 64bit moves or 2
128bit moves. 

#include <string.h>
char *src;
char *dest;
char *dest2;
void
test ()
{
        memcpy (dest, src, 31);
}
void
test2 ()
{
        memset (dest2, 0, 31);
}

compiles to
test:
        movq    src(%rip), %rdx
        movq    dest(%rip), %rax
        movdqu  (%rdx), %xmm0
        movups  %xmm0, (%rax)
        movdqu  15(%rdx), %xmm0
        movups  %xmm0, 15(%rax)
        ret
test2:
        movq    dest2(%rip), %rax
        pxor    %xmm0, %xmm0
        movups  %xmm0, (%rax)
        movups  %xmm0, 15(%rax)
        ret

The copy algorithm tables are mostly used when the block size is greater
than what we can copy/set by COPY/CLEAR_RATIO or when we know expected
size by profile feedback. In relatively relatively rare cases when value
ranges delivers useful range we use it too. (Some work would be needed
to make this work better). But it works on simple testcases.  For
example:

#include <string.h>
void
test (char *dest, int n)
{
        memset (dest, 0, 30 + (n != 0));
}

compiles to a loop instead of library call since we know that the code
will copy 30 or 31 bytes.

This testcase should be copmiled too:
#include <string.h>
char dest[31];
void
test (int n)
{
        memset (dest, 0, n);
}

since the upper bound on the block size is 31 bytes, but we fail to
detect that.

Honza
> 
> > We are also better on picking codegen choice with PGO since we
> > value-profile size of the block.
> >
> > Inlining memcpy is bigger win in situation where it prevents spilling
> > data from caller saved registers.  This makes it a bit hard to guess how
> > microbenchmarks relate to more real-world situations where the
> > surrounding code may need to hold data in SSE regs etc.
> > If we had a special entry-point to memcpy/memset that does not clobber
> > registers and does its own callee save, this problem would go away...
> >
> > Honza
> 
> 
> 
> -- 
> H.J.

Reply via email to