https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
--- Comment #24 from rguenther at suse dot de <rguenther at suse dot de> --- On Thu, 27 Aug 2015, wschmidt at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021 > > --- Comment #22 from Bill Schmidt <wschmidt at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #21) > > (In reply to Bill Schmidt from comment #20) > > ...<snip>... > > > > I see it only failing due to cost issues (tried ppc64le and -mcpu=power8). > > The unaligned loads cost 3 and we end up with > > > > t.f90:8:0: note: Cost model analysis: > > Vector inside of loop cost: 40 > > Vector prologue cost: 8 > > Vector epilogue cost: 4 > > Scalar iteration cost: 12 > > Scalar outside cost: 6 > > Vector outside cost: 12 > > prologue iterations: 0 > > epilogue iterations: 0 > > t.f90:8:0: note: cost model: the vector iteration cost = 40 divided by the > > scalar iteration cost = 12 is greater or equal to the vectorization factor = > > 1. > > > > Note that we are (still) not very good in estimating the SLP cost as we > > account 4 vector loads here (because we essentially will end up with > > 4 different permutations used), so the "unaligned" part is accounted for > > too much and likely the permutation cost as well. Both are a limitation > > of the SLP data structures and not easily fixable. With > > -fvect-cost-model=unlimited I see both loops vectorized. > > Yes, I get these same results for the loop vectorizer (using -O2 > -ftree-vectorize -mcpu=power8 -ffast-math). But I was looking at the failure > to do SLP vectorization. In comment 19 you indicated this was now working, > presumably on x86, but for Power we fail to SLP-vectorize > fast-math-pr37021.f90:9:0. Err, I meant loop SLP vectorization as opposed to loop vectorization with interleaving... Basic-block SLP doesn't work because (at least) it does not handle reductions yet (I have done some early work here but wasn't able to finish it)