https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037
--- Comment #21 from Richard Biener <rguenth at gcc dot gnu.org> --- So after r257453 we improve the situation pre-IVOPTs to just 6 IVs (duplicated but trivially equivalent) plus one counting IV. But then when SLP is enabled IVOPTs comes along and adds another 4 IVs which makes us spill... (for AVX256, so you need -march=core-avx2 for example). Bin, any chance you can take a look? In the IVO dump I see target_avail_regs 15 target_clobbered_regs 9 target_reg_cost 4 target_spill_cost 8 regs_used 3 ^^^ and regs_used looks awfully low to me. The loop has even more IVs initially plus variable steps for that IVs which means we need two regs per IV. There doesn't seem to be a way to force IVOPTs to use the minimal set of IVs? Or just use the original set, removing the obvious redundancies? There is a microarchitectural issue left with the vectorization but the spilling obscures the look quite a bit :/