On 14/06/2021 11:57, Richard Biener wrote:
On Mon, 14 Jun 2021, Richard Biener wrote:Indeed. For example a simple int a[1024], b[1024], c[1024]; void foo(int n) { for (int i = 0; i < n; ++i) a[i+1] += c[i+i] ? b[i+1] : 0; } should usually see peeling for alignment (though on x86 you need exotic -march= since cost models generally have equal aligned and unaligned access costs). For example with -mavx2 -mtune=atom we'll see an alignment peeling prologue, a AVX2 vector loop, a SSE2 vectorized epilogue and a scalar epilogue. It also shows the original scalar loop being used in the scalar prologue and epilogue. We're not even trying to make the counting IV easily used across loops (we're not counting scalar iterations in the vector loops).Specifically we see <bb 33> [local count: 94607391]: niters_vector_mult_vf.10_62 = bnd.9_61 << 3; _67 = niters_vector_mult_vf.10_62 + 7; _64 = (int) niters_vector_mult_vf.10_62; tmp.11_63 = i_43 + _64; if (niters.8_45 == niters_vector_mult_vf.10_62) goto <bb 37>; [12.50%] else goto <bb 36>; [87.50%] after the maini vect loop, recomputing the original IV (i) rather than using the inserted canonical IV. And then the vectorized epilogue header check doing <bb 36> [local count: 93293400]: # i_59 = PHI <tmp.11_63(33), 0(18)> # _66 = PHI <_67(33), 0(18)> _96 = (unsigned int) n_10(D); niters.26_95 = _96 - _66; _108 = (unsigned int) n_10(D); _109 = _108 - _66; _110 = _109 + 4294967295; if (_110 <= 3) goto <bb 47>; [10.00%] else goto <bb 40>; [90.00%] re-computing everything from scratch again (also notice how the main vect loop guard jumps around the alignment prologue as well and lands here - and the vectorized epilogue using unaligned accesses - good!). That is, I'd expect _much_ easier jobs if we'd manage to track the number of performed scalar iterations (or the number of scalar iterations remaining) using the canonical IV we add to all loops across all of the involved loops. Richard.
So I am now looking at using an IV that counts scalar iterations rather than vector iterations and reusing that through all loops, (prologue, main loop, vect_epilogue and scalar epilogue). The first is easy, since that's what we already do for partial vectors or non-constant VFs. The latter requires some plumbing and removing a lot of the code in there that creates new IV's going from [0, niters - previous iterations]. I don't yet have a clear cut view of how to do this, I first thought of keeping track of the 'control' IV in the loop_vinfo, but the prologue and scalar epilogues won't have one. 'loop' keeps a control_ivs struct, but that is used for overflow detection and only keeps track of what looks like a constant 'base' and 'step'. Not quite sure how all that works, but intuitively doesn't seem like the right thing to reuse.
I'll go hack around and keep you posted on progress. Regards, Andre
