Hi Steve, On Mon, 6 Mar 2017, Steve Ellcey wrote:
> I was looking at the spec 456.hmmer benchmark and this email string > from Jeff Law and Micheal Matz: > > https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01970.html > > and was wondering if anyone was looking at what more it would take > for GCC to vectorize the loop in P7Viterbi. It takes what I wrote in there. There are two important things that need to happen to get the best performance (at least from an analysis I did in 2011, but nothing material should have changed since then): (1) loop distribution to make some memory streams vectorizable (and leave the others in non-vectorized form). (1a) loop splitting based on conditional (to remove the k<M conditional) (2) a predictive commoning (or loop carried store reuse) on the dc[] stream Non of these is valid if the loop streams can't be disambiguated, and as this is C only adding explicit restrict qualifiers would give you that, or runtime disambiguation, like ICC is doing, that's part (0). Part (1a) is implemented by gimple loop splitting, but it's not effective on hmmer because of missing part (0). Even then performance is not on par with ICC as long as parts (1) and (2) are missing as well (measured by splitting the loops by hand and adding restrict qualifiers explicitely). The inner loop split by streams and conditinional looks like so: for (k = 1; k < M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; } for (k = 1; k < M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; } for (k = 1; k < M; k++) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; } /* last iteration of original loop */ k = M; mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; (Note again, that this is only valid with disambiguation). Adding restrict qualifiers at the top of routine like so: #define R __restrict int * R mc, * R dc, * R ic; /* pointers to rows of mmx, dmx, imx */ int * R ms, * R is; /* pointers to msc[i], isc[i] */ int * R mpp, * R mpc, * R ip; /* ptrs to mmx[i-1], mmx[i], imx[i-1] */ int * R bp; /* ptr into bsc[] */ int * R ep; /* ptr into esc[] */ int * R dpp; /* ptr into dmx[i-1] (previous row) */ int * R tpmm, * R tpmi, * R tpmd, * R tpim, * R tpii, * R tpdm, * R tpdd; /* ptrs into tsc */ helps to vectorize this. To get the final rest of performance also this transformation needs to happen on the dc[] loop: dctemp=dc[0]; for (k = 1; k < M; k++) { dctemp = dctemp + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dctemp) dctemp = sc; if (dctemp < -INFTY) dctemp = -INFTY; dc[k] = dctemp; } Our loop distribution should actually already be able to split off the three memory streams when restrict is added everywhere, at the 2011 time frame it didn't do it nevertheless (and I haven't looked if it would be able to do that now). predictive commoning could do the dc[] transformation (part (2)), except that it can't without disambiguation. That adding restrict doesn't help here is PR50419, but ultimately it would have to work on the disambiguated loop (without the restrict pointer). So really the prerequisite to optimize hmmer is loop disambiguation, even with the many streams (and hence conditionals) that are there. And it needs to happen well before the loop vectorizer, because loop splitting and distribution, _and_ predictive commoning have the disambiguation as prerequisite in this testcase. After that loop distribution needs to be looked at why it doesn't want to distribute the streams, and then a variant of PR50419 needs to be fixed based on disambiguation info (not based on restrict). For that we need infrastructure that would enable us to disambiguate mem accesses after loop nest versioning happened in the "good" version. Ciao, Michael.