Andrew Stubbs <a...@codesourcery.com> writes: > Hi all, > > Up until now the AMD GCN port has been using exclusively 64-lane vectors > with masking for smaller sizes. > > This works quite well, where it works, but there remain many test cases > (and no doubt some real code) that refuse to vectorize because the > number of iterations (or SLP equivalent) are smaller than the > vectorization factor. > > My question is: are there any plans to fill in these missing cases? Or, > is relying on masking alone just not feasible?
This is supported for loop vectorisation. E.g.: void f (short *x) { for (int i = 0; i < 7; ++i) x[i] += 1; } generates: ptrue p0.h, vl7 ld1h z0.h, p0/z, [x0] add z0.h, z0.h, #1 st1h z0.h, p0, [x0] ret for SVE. BB SLP is on the wish-list for GCC 11, but no promises. :-) Early peeling/complete unrolling can cause loops to be straight-line code by the time the vectoriser sees them. E.g. the loop above doesn't use masked SVE for "i < 3". Which kind of cases fail for GCN? Thanks, Richard > > I've dabbled in the vectorizer code, of course, but I can't claim to > have much of a feel for it as a whole. I may be able to help with the > effort in future, but for now I'm struggling to judge what's even needed. > > For GCN the vectorization is quite important as scalar code is slow, and > adding vectorization is usually cheap. The architecture can do any > vector size between 1 and 64 lanes (not just powers of two), so being > smaller than the vectorization factor really ought not be a problem. > > To fix this, I've been considering adding extra vector sizes (probably > 2, 4, 8, 16, 32) where the backend would take care of the masking. > Asside from reductions and permutations the changes would be somewhat > trivial, but the explosion in the number of generated patterns would be > enormous, and it still won't allow arbitrary size vectors. > > Thank you for your time; I'm trying to decide where my efforts should lie. > > Andrew