https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64716
--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Say on: a2->qyz -= (k2+ka2+kb2)*yt*zt; a1->qyz -= (k2+ka2+kb2)*yt*zt; a2->qzz -= k2*(zt2 - 1./3) + ka2*(zt2 - 1./8)+kb2*(zt2-1./14) ; a1->qzz -= k2*(zt2 - 1./3) + ka2*(zt2 - 1./8)+kb2*(zt2-1./14) ; it seems that temp1 = (k2+ka2+kb2)*yt*zt and temp2 = k2*(zt2 - 1./3) + ka2*(zt2 - 1./8)+kb2*(zt2-1./14) are computed in scalar code, then combined into a V2DFmode vector and the a1->qyz -= temp1; a1->qzz -= temp2; a2->qyz -= temp1; a2->qyz -= temp2; is already performed using vectorized code. We'd need to carefully analyze the costs if putting the scalars into the vector is beneficial, but supposedly it is if the score shows that. Or the: xt = (*vector)[j] * r0; yt = (*vector)[j + 1] * r0; zt = (*vector)[j + 2] * r0; a2->dpx -= k1 * xt; a1->dpx += k1 * xt; a2->dpy -= k1 * yt; a1->dpy += k1 * yt; a2->dpz -= k1 * zt; a1->dpz += k1 * zt; part shows that even though this would be ideally vectorized with V3DFmode vectors, it can be vectorized using V2DFmode + scalar for the *z* elements. Or say for a group of 6 we could consider vectorizing with 4 units vector and 2 units vector for the remainder (perhaps split appart the SLP instance for that, analyze each individually?).