Hi all,
Up until now the AMD GCN port has been using exclusively 64-lane vectors
with masking for smaller sizes.
This works quite well, where it works, but there remain many test cases
(and no doubt some real code) that refuse to vectorize because the
number of iterations (or SLP equivalent) are smaller than the
vectorization factor.
My question is: are there any plans to fill in these missing cases? Or,
is relying on masking alone just not feasible?
I've dabbled in the vectorizer code, of course, but I can't claim to
have much of a feel for it as a whole. I may be able to help with the
effort in future, but for now I'm struggling to judge what's even needed.
For GCN the vectorization is quite important as scalar code is slow, and
adding vectorization is usually cheap. The architecture can do any
vector size between 1 and 64 lanes (not just powers of two), so being
smaller than the vectorization factor really ought not be a problem.
To fix this, I've been considering adding extra vector sizes (probably
2, 4, 8, 16, 32) where the backend would take care of the masking.
Asside from reductions and permutations the changes would be somewhat
trivial, but the explosion in the number of generated patterns would be
enormous, and it still won't allow arbitrary size vectors.
Thank you for your time; I'm trying to decide where my efforts should lie.
Andrew