https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #9 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
So on SVE the change is cost modelling.

Bisect landed on g:33c2b70dbabc02788caabcbc66b7baeafeb95bcf which changed the
compiler's defaults to using the new throughput matched cost modelling used be
newer cores.

It looks like this changes which mode the compiler picks for when using a fixed
register size.

This is because the new cost model (correctly) models the costs for FMAs and
promotions.

Before:

array1[0][_1] 1 times scalar_load costs 1 in prologue
int) _2 1 times scalar_stmt costs 1 in prologue

after:

array1[0][_1] 1 times scalar_load costs 1 in prologue 
(int) _2 1 times scalar_stmt costs 0 in prologue 

and the cost goes from:

Vector inside of loop cost: 125

to

Vector inside of loop cost: 83 

so far, nothing sticks out, and in fact the profitability for VNx4QI drops from

Calculated minimum iters for profitability: 5

to

Calculated minimum iters for profitability: 3

This causes a clash, as this is now exactly the same cost as VNx2QI which used
to be what it preferred before.

Which then leads it to pick the higher VF.

In the end smaller VF shows:

;; Guessed iterations of loop 4 is 0.500488. New upper bound 1.

and now we get:

Vectorization factor 16 seems too large for profile prevoiusly believed to be
consistent; reducing.  
;; Guessed iterations of loop 4 is 0.500488. New upper bound 0.
;; Scaling loop 4 with scale 66.6% (guessed) to reach upper bound 0

which I guess is the big difference.

There is a weird costing going on in the PHI nodes though:

m_108 = PHI <m_92(16), m_111(5)> 1 times vector_stmt costs 0 in body 
m_108 = PHI <m_92(16), m_111(5)> 2 times scalar_to_vec costs 0 in prologue

they have collapsed to 0. which can't be right..

Reply via email to