[Bug rtl-optimization/50728] Inefficient vector loads from aggregates passed by value

rth at gcc dot gnu.org Fri, 14 Oct 2011 11:11:29 -0700

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50728


Richard Henderson <rth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011-10-14
                 CC|                            |rth at gcc dot gnu.org
     Ever Confirmed|0                           |1

--- Comment #3 from Richard Henderson <rth at gcc dot gnu.org> 2011-10-14 
18:10:07 UTC ---
The problem is that the ABI was designed with the scalar operations
in mind, rather than possible vectorization.  If you consider an
alternate function

A foo(A a, A b)
{
  a.a[0] += b.a[0];
  a.a[1] -= b.a[1];
  a.a[2] *= b.a[2];
  a.a[3] /= b.a[3];
  return a;
}

then the way the ABI passes the floats *is* optimal.  I.e. already
unpacked in the registers, ready for use in their scalar operations.

What you're asking for is a special private ABI for "sum", with the
knowledge that the inputs are used, packed in their vectors.

Given that you can achieve the parameter register assignment that
you want via passing the proper vector type, this seems to be a 
simple matter of function cloning/versioning:

  V4SF sum.vector(V4SF a, V4SF b)
  {
    return a + b;
  }

  user_of_sum()
  {
    ...
    V4SF r.v = sum.vector(VIEW_CONVERT<V4SF, a>, VIEW_CONVERT<V4SF, b>);
    A r = VIEW_CONVERT<A, r.v>;
    ...
  }

Of course, I've no idea how you're going to decide when to produce
this particular clone.  That seems like a fairly hard decision to make,
given the relative placements of the vectorization passes and the
IPA passes.

[Bug rtl-optimization/50728] Inefficient vector loads from aggregates passed by value

Reply via email to