https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

--- Comment #24 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 27 Aug 2015, wschmidt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
> 
> --- Comment #22 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #21)
> > (In reply to Bill Schmidt from comment #20)
> 
> ...<snip>...
> > 
> > I see it only failing due to cost issues (tried ppc64le and -mcpu=power8).
> > The unaligned loads cost 3 and we end up with
> > 
> > t.f90:8:0: note: Cost model analysis:
> >   Vector inside of loop cost: 40
> >   Vector prologue cost: 8
> >   Vector epilogue cost: 4
> >   Scalar iteration cost: 12
> >   Scalar outside cost: 6
> >   Vector outside cost: 12
> >   prologue iterations: 0
> >   epilogue iterations: 0
> > t.f90:8:0: note: cost model: the vector iteration cost = 40 divided by the
> > scalar iteration cost = 12 is greater or equal to the vectorization factor =
> > 1.
> > 
> > Note that we are (still) not very good in estimating the SLP cost as we
> > account 4 vector loads here (because we essentially will end up with
> > 4 different permutations used), so the "unaligned" part is accounted for
> > too much and likely the permutation cost as well.  Both are a limitation
> > of the SLP data structures and not easily fixable.  With
> > -fvect-cost-model=unlimited I see both loops vectorized.
> 
> Yes, I get these same results for the loop vectorizer (using -O2
> -ftree-vectorize -mcpu=power8 -ffast-math).  But I was looking at the failure
> to do SLP vectorization.  In comment 19 you indicated this was now working,
> presumably on x86, but for Power we fail to SLP-vectorize
> fast-math-pr37021.f90:9:0.

Err, I meant loop SLP vectorization as opposed to loop vectorization
with interleaving...  Basic-block SLP doesn't work because (at least)
it does not handle reductions yet (I have done some early work here
but wasn't able to finish it)

Reply via email to