http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
--- Comment #18 from rguenther at suse dot de <rguenther at suse dot de> 2012-06-21 08:46:11 UTC --- On Wed, 20 Jun 2012, hjl.tools at gmail dot com wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726 > > --- Comment #17 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 15:36:09 > UTC --- > (In reply to comment #16) > > But I am not sure if a good library implementation shouldn't be always > > preferable to a byte-wise copy. We could, at least try to envision a way > > to retain and use the knowledge that the size is at most 8 when expanding > > the memcpy (with AVX we could use a masked store for example - quite fancy). > > string/memory functions in libc can be much faster than the ones generated > by GCC unless the size is very small, PR 43052. Yes. The question is what is "very small" and how can we possibly detect "very small". For this testcase we can derive an upper bound of the size, which is 8, but the size is not constant. I think unless we know we can expand the variable-size memcpy with, say, three CPU instructions inline there is no reason to not call memcpy. Thus if the CPU could do tem = unaligned-load-8-bytes-from-src-and-ignore-faults; mask = generate mask from size store-unaligned-8-bytes-with-maxk then expanding the memcpy call inline would be a win I suppose. AVX has VMASKMOV, but I'm not sure using that for sizes <= 16 bytes is profitable? Note that from the specs of VMASKMOV it seems the memory operands need to be aligned and the mask does not support byte-granularity. Which would leave us to inline expanding the case of at most 2 byte memcpy. Of course currently there is no way to record an upper bound for the size (we do not retain value-range information - but we of course should).