http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726

--- Comment #18 from rguenther at suse dot de <rguenther at suse dot de> 
2012-06-21 08:46:11 UTC ---
On Wed, 20 Jun 2012, hjl.tools at gmail dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53726
> 
> --- Comment #17 from H.J. Lu <hjl.tools at gmail dot com> 2012-06-20 15:36:09 
> UTC ---
> (In reply to comment #16)
> > But I am not sure if a good library implementation shouldn't be always
> > preferable to a byte-wise copy.  We could, at least try to envision a way
> > to retain and use the knowledge that the size is at most 8 when expanding
> > the memcpy (with AVX we could use a masked store for example - quite fancy).
> 
> string/memory functions in libc can be much faster than the ones generated
> by GCC unless the size is very small, PR 43052.

Yes.  The question is what is "very small" and how can we possibly
detect "very small".  For this testcase we can derive an upper bound
of the size, which is 8, but the size is not constant.  I think unless
we know we can expand the variable-size memcpy with, say, three
CPU instructions inline there is no reason to not call memcpy.

Thus if the CPU could do

  tem = unaligned-load-8-bytes-from-src-and-ignore-faults;
  mask = generate mask from size
  store-unaligned-8-bytes-with-maxk

then expanding the memcpy call inline would be a win I suppose.
AVX has VMASKMOV, but I'm not sure using that for sizes <= 16
bytes is profitable?  Note that from the specs
of VMASKMOV it seems the memory operands need to be aligned and
the mask does not support byte-granularity.

Which would leave us to inline expanding the case of at most 2 byte
memcpy.  Of course currently there is no way to record an upper
bound for the size (we do not retain value-range information - but
we of course should).

Reply via email to