https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037
--- Comment #24 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to amker from comment #23) > (In reply to Richard Biener from comment #21) > > So after r257453 we improve the situation pre-IVOPTs to just > > 6 IVs (duplicated but trivially equivalent) plus one counting IV. But then > > when SLP is enabled IVOPTs comes along and adds another 4 IVs which makes us > > spill... (for AVX256, so you need -march=core-avx2 for example). > > > > Bin, any chance you can take a look? In the IVO dump I see > > > > target_avail_regs 15 > > target_clobbered_regs 9 > > target_reg_cost 4 > > target_spill_cost 8 > > regs_used 3 > > ^^^ > > > > and regs_used looks awfully low to me. The loop has even more IVs initially > > plus variable steps for that IVs which means we need two regs per IV. > > > > There doesn't seem to be a way to force IVOPTs to use the minimal set of > > IVs? > > Or just use the original set, removing the obvious redundancies? There is > > a microarchitectural issue left with the vectorization but the spilling > > obscures the look quite a bit :/ > > Sure, I will have a look based on your commit. Thanks Note the loop in question is the one starting at line 551, it gets inlined multiple times but the issue is visible with -fno-inline as well. -mavx2 makes things worse (compared to -mavx2 -mprefer-avx128) because for the strided accesses we choose to compute extra invariants for the two strides of A and E. For SSE we keep stride and stride * 3 while for AVX we additionally compute stride * 5, stride * 6 and stride * 7 (in the cases we don't choose another base IV). At least computing stride * 6 can be avoided by using stride * 3 with step 2 - but it's probably too hard to see that within the current IVO model? I'm not sure avoiding an invariant in exchange for an extra IV is ever a good idea? Spilling an invariant should be cheaper than spilling an IV - but yes, the addressing mode possibly looks to offset for any bias we apply there. Note the vectorizer itself tries to avoid computing stride * N by strength-reducing it: _711 = (sizetype) iftmp.472_91; _712 = _711 * 64; _715 = (sizetype) iftmp.472_91; _716 = _715 * 8; ... # ivtmp_891 = PHI <ivtmp_892(28), _710(44)> ... _893 = MEM[(real(kind=4) *)ivtmp_891]; ivtmp_894 = ivtmp_891 + _716; _895 = MEM[(real(kind=4) *)ivtmp_894]; ivtmp_896 = ivtmp_894 + _716; _897 = MEM[(real(kind=4) *)ivtmp_896]; ivtmp_898 = ivtmp_896 + _716; _899 = MEM[(real(kind=4) *)ivtmp_898]; ivtmp_900 = ivtmp_898 + _716; _901 = MEM[(real(kind=4) *)ivtmp_900]; ivtmp_902 = ivtmp_900 + _716; _903 = MEM[(real(kind=4) *)ivtmp_902]; ivtmp_904 = ivtmp_902 + _716; _905 = MEM[(real(kind=4) *)ivtmp_904]; ivtmp_906 = ivtmp_904 + _716; _907 = MEM[(real(kind=4) *)ivtmp_906]; vect_cst__909 = {_893, _895, _897, _899, _901, _903, _905, _907}; ... ivtmp_892 = ivtmp_891 + _712; note how it advances the IV in one step at the end though. Not sure if IVO is confused by that or the way we compute _716 vs. _712. That said, the summary is that IVO behavior with unrolled loop bodies with variable stride isn't helping here ;)