> > Do you know what of the three changes (preferring reps/stosb,
> > CLEAR_RATIO and algorithm choice changes) cause the two speedups
> > on eebmc?
> 
> A extracted testcase from nnet_test in https://godbolt.org/z/c8KdsohTP
> 
> This loop is transformed to builtin_memcpy and builtin_memset with size 280.
> 
> Current strategy for skylake is {512, unrolled_loop, false} for such
> size, so it will generate unrolled loops with mov, while the patch
> generates memcpy/memset libcall and uses vector move.

This is good - I originally set the table based on this
micro-benchmarking script and apparently glibc used at that time had
more expensive memcpy for small blocks.

One thing to consider is, however, that calling external memcpy has also
additional cost of clobbering all caller saved registers.  Especially
for code that uses SSE this is painful since all needs to go to stack in
that case. So I am not completely sure how representative the
micro-benchmark is to this respect since it does not use any SSE and
register pressure is generally small.

So with current glibc it seems libcall is win for blocks of size greater
than 64 or 128 at least if the register pressure is not big.
With this respect your change looks good.
> >
> > My patch generates "rep movsb" only in a very limited cases:
> >
> > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> >    load and store for up to 16 * 16 (256) bytes when the data size is
> >    fixed and known.
> > 2. Inline only if data size is known to be <= 256.
> >    a. Use "rep movsb/stosb" with a simple code sequence if the data size
> >       is a constant.
> >    b. Use loop if data size is not a constant.

Aha, this is very hard to read from the algorithm descriptor.  So we
still have the check that maxsize==minsize and use rep mosb only for
constant sized blocks when the corresponding TARGET macro is defined.

I think it would be more readable if we introduced rep_1_byte_constant.
The descriptor is supposed to read as a sequence of rules where fist
applies.  It is not obvious that we have another TARGET_* macro that
makes rep_1_byte to be ignored in some cases.
(TARGET macro will also interfere with the microbenchmarking script).

Still I do not understand why compile time constant makes rep mosb/stosb
better than loop. Is it CPU special casing it at decoder time and
requiring explicit mov instruction? Or is it only becuase rep mosb is
not good for blocks smaller than 128bit?

> >
> > As a result,  "rep stosb" is generated only when 128 < data size < 256
> > with -mno-sse.
> >
> > > Do you have some data for blocks in size 8...256 to be faster with rep1
> > > compared to unrolled loop for perhaps more real world benchmarks?
> >
> > "rep movsb" isn't generated with my patch in this case since
> > MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with
> > XMM registers.

OK, so I guess:
  {libcall,
   {{256, rep_1_byte, true},
    {256, unrolled_loop, false},
    {-1, libcall, false}}},
  {libcall,
   {{256, rep_1_loop, true},
    {256, unrolled_loop, false},
    {-1, libcall, false}}}};

may still perform better but the differnece between loop and unrolled
loop is within 10% margin..

So i guess patch is OK and we should look into cleaning up the
descriptors.  I can make patch for that once I understand the logic above.

Honza
> >
> > > The difference seems to get quite big for small locks in range 8...16
> > > bytes.  I noticed that before and sort of conlcuded that it is probably
> > > the branch prediction playing relatively well for those small block
> > > sizes. On the other hand winding up the relatively long unrolled loop is
> > > not very cool just to catch this case.
> > >
> > > Do you know what of the three changes (preferring reps/stosb,
> > > CLEAR_RATIO and algorithm choice changes) cause the two speedups
> > > on eebmc?
> >
> > Hongyu, can you find out where the speedup came from?
> >
> > Thanks.
> >
> > --
> > H.J.

Reply via email to