Hi all,

Up until now the AMD GCN port has been using exclusively 64-lane vectors with masking for smaller sizes.

This works quite well, where it works, but there remain many test cases (and no doubt some real code) that refuse to vectorize because the number of iterations (or SLP equivalent) are smaller than the vectorization factor.

My question is: are there any plans to fill in these missing cases? Or, is relying on masking alone just not feasible?

I've dabbled in the vectorizer code, of course, but I can't claim to have much of a feel for it as a whole. I may be able to help with the effort in future, but for now I'm struggling to judge what's even needed.

For GCN the vectorization is quite important as scalar code is slow, and adding vectorization is usually cheap. The architecture can do any vector size between 1 and 64 lanes (not just powers of two), so being smaller than the vectorization factor really ought not be a problem.

To fix this, I've been considering adding extra vector sizes (probably 2, 4, 8, 16, 32) where the backend would take care of the masking. Asside from reductions and permutations the changes would be somewhat trivial, but the explosion in the number of generated patterns would be enormous, and it still won't allow arbitrary size vectors.

Thank you for your time; I'm trying to decide where my efforts should lie.

Andrew

Reply via email to