https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441
--- Comment #9 from Tamar Christina <tnfchris at gcc dot gnu.org> --- So on SVE the change is cost modelling. Bisect landed on g:33c2b70dbabc02788caabcbc66b7baeafeb95bcf which changed the compiler's defaults to using the new throughput matched cost modelling used be newer cores. It looks like this changes which mode the compiler picks for when using a fixed register size. This is because the new cost model (correctly) models the costs for FMAs and promotions. Before: array1[0][_1] 1 times scalar_load costs 1 in prologue int) _2 1 times scalar_stmt costs 1 in prologue after: array1[0][_1] 1 times scalar_load costs 1 in prologue (int) _2 1 times scalar_stmt costs 0 in prologue and the cost goes from: Vector inside of loop cost: 125 to Vector inside of loop cost: 83 so far, nothing sticks out, and in fact the profitability for VNx4QI drops from Calculated minimum iters for profitability: 5 to Calculated minimum iters for profitability: 3 This causes a clash, as this is now exactly the same cost as VNx2QI which used to be what it preferred before. Which then leads it to pick the higher VF. In the end smaller VF shows: ;; Guessed iterations of loop 4 is 0.500488. New upper bound 1. and now we get: Vectorization factor 16 seems too large for profile prevoiusly believed to be consistent; reducing. ;; Guessed iterations of loop 4 is 0.500488. New upper bound 0. ;; Scaling loop 4 with scale 66.6% (guessed) to reach upper bound 0 which I guess is the big difference. There is a weird costing going on in the PHI nodes though: m_108 = PHI <m_92(16), m_111(5)> 1 times vector_stmt costs 0 in body m_108 = PHI <m_92(16), m_111(5)> 2 times scalar_to_vec costs 0 in prologue they have collapsed to 0. which can't be right..