http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50728
Richard Henderson <rth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2011-10-14 CC| |rth at gcc dot gnu.org Ever Confirmed|0 |1 --- Comment #3 from Richard Henderson <rth at gcc dot gnu.org> 2011-10-14 18:10:07 UTC --- The problem is that the ABI was designed with the scalar operations in mind, rather than possible vectorization. If you consider an alternate function A foo(A a, A b) { a.a[0] += b.a[0]; a.a[1] -= b.a[1]; a.a[2] *= b.a[2]; a.a[3] /= b.a[3]; return a; } then the way the ABI passes the floats *is* optimal. I.e. already unpacked in the registers, ready for use in their scalar operations. What you're asking for is a special private ABI for "sum", with the knowledge that the inputs are used, packed in their vectors. Given that you can achieve the parameter register assignment that you want via passing the proper vector type, this seems to be a simple matter of function cloning/versioning: V4SF sum.vector(V4SF a, V4SF b) { return a + b; } user_of_sum() { ... V4SF r.v = sum.vector(VIEW_CONVERT<V4SF, a>, VIEW_CONVERT<V4SF, b>); A r = VIEW_CONVERT<A, r.v>; ... } Of course, I've no idea how you're going to decide when to produce this particular clone. That seems like a fairly hard decision to make, given the relative placements of the vectorization passes and the IPA passes.