On Thu, Dec 13, 2012 at 12:26 PM, Jan Hubicka <hubi...@ucw.cz> wrote: >> On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek <ja...@redhat.com> wrote: >> > On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote: >> >> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka <hubi...@ucw.cz> wrote: >> >> >> > libcall is not faster up to 8KB to rep sequence that is better for >> >> >> > regalloc/code >> >> >> > cache than fully blowin function call. >> >> >> >> >> >> Be careful with this. My recollection is that REP sequence is good for >> >> >> any size -- for smaller size, the REP initial set up cost is too high >> >> >> (10s of cycles), while for large size copy, it is less efficient >> >> >> compared with library version. >> >> > >> >> > Well this is based on the data from the memtest script. >> >> > Core has good REP implementation - it is a win from rather small blocks >> >> > (16 >> >> > bytes if I recall) and it does not need alignment. >> >> > Library version starts to be interesting with caching hints, but I >> >> > think till 80KB >> >> > it is still not a win for my setup (glibc-2.15) >> >> >> >> A simple test shows that -mstringop-strategy=libcall always beats >> >> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size >> >> smaller than 8 where the rep_8byte strategy simply bypasses REP movs. >> >> Can you share your memtest ? >> > >> > I can't believe that say 16 byte or 32 byte memcpy can be ever faster >> > using a >> > libcall. The PLT call overhead is simply too high. >> > >> >> The x86 string/memory functions in the current glibc are >> extremely fast and tuned for Core 2/Core i7. GCC is having >> a very hard time to beat them with inlining: >> >> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 > > Here we speak about memcpy/memset only. I never got around to modernize > strlen and friends, unfortunately... > > memcmp and friends are different beats. They realy need some TLC...
memcpy and memset in glibc are also extremely fast. -- H.J.