https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64716

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Say on:
 a2->qyz -= (k2+ka2+kb2)*yt*zt;
 a1->qyz -= (k2+ka2+kb2)*yt*zt;
 a2->qzz -= k2*(zt2 - 1./3) + ka2*(zt2 - 1./8)+kb2*(zt2-1./14) ;
 a1->qzz -= k2*(zt2 - 1./3) + ka2*(zt2 - 1./8)+kb2*(zt2-1./14) ;
it seems that
temp1 = (k2+ka2+kb2)*yt*zt
and
temp2 = k2*(zt2 - 1./3) + ka2*(zt2 - 1./8)+kb2*(zt2-1./14)
are computed in scalar code, then combined into a V2DFmode vector and the
a1->qyz -= temp1;
a1->qzz -= temp2;
a2->qyz -= temp1;
a2->qyz -= temp2;
is already performed using vectorized code.  We'd need to carefully analyze the
costs if putting the scalars into the vector is beneficial, but supposedly it
is if the score shows that.

Or the:
                      xt = (*vector)[j] * r0;
                      yt = (*vector)[j + 1] * r0;
                      zt = (*vector)[j + 2] * r0;
                      a2->dpx -= k1 * xt;
                      a1->dpx += k1 * xt;
                      a2->dpy -= k1 * yt;
                      a1->dpy += k1 * yt;
                      a2->dpz -= k1 * zt;
                      a1->dpz += k1 * zt;
part shows that even though this would be ideally vectorized with V3DFmode
vectors, it can be vectorized using V2DFmode + scalar for the *z* elements.
Or say for a group of 6 we could consider vectorizing with 4 units vector and 2
units vector for the remainder (perhaps split appart the SLP instance for that,
analyze each individually?).

Reply via email to