epilogue for modern CPUs

Jan Hubicka Thu, 13 Dec 2012 12:26:16 -0800

> On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek <ja...@redhat.com> wrote:
> > On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
> >> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka <hubi...@ucw.cz> wrote:
> >> >> > libcall is not faster up to 8KB to rep sequence that is better for 
> >> >> > regalloc/code
> >> >> > cache than fully blowin function call.
> >> >>
> >> >> Be careful with this. My recollection is that REP sequence is good for
> >> >> any size -- for smaller size, the REP initial set up cost is too high
> >> >> (10s of cycles), while for large size copy, it is less efficient
> >> >> compared with library version.
> >> >
> >> > Well this is based on the data from the memtest script.
> >> > Core has good REP implementation - it is a win from rather small blocks 
> >> > (16
> >> > bytes if I recall) and it does not need alignment.
> >> > Library version starts to be interesting with caching hints, but I think 
> >> > till 80KB
> >> > it is still not a win for my setup (glibc-2.15)
> >>
> >> A simple test shows that -mstringop-strategy=libcall always beats
> >> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
> >> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
> >> Can you share your memtest ?
> >
> > I can't believe that say 16 byte or 32 byte memcpy can be ever faster using 
> > a
> > libcall.  The PLT call overhead is simply too high.
> >
> 
> The x86 string/memory functions in the current glibc are
> extremely fast and tuned for Core 2/Core i7.  GCC is having
> a very hard time to beat them with inlining:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052


Here we speak about memcpy/memset only.  I never got around to modernize
strlen and friends, unfortunately...

memcmp and friends are different beats.  They realy need some TLC...

Honza

Re: [PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs

Reply via email to