On Thu, 15 Feb 2024, Andrew Stubbs wrote: > On 15/02/2024 10:21, Richard Biener wrote: > [snip] > >>> I suppse if RDNA really only has 32 lane vectors (it sounds like it, > >>> even if it can "simulate" 64 lane ones?) then it might make sense to > >>> vectorize for 32 lanes? That said, with variable-length it likely > >>> doesn't matter but I'd not expose fixed-size modes with 64 lanes then? > >> > >> For most operations, wavefrontsize=64 works just fine; the GPU runs each > >> instruction twice and presents a pair of hardware registers as a logical > >> 64-lane register. This breaks down for permutations and reductions, and is > >> obviously inefficient when they vectors are not fully utilized, but is > >> otherwise compatible with the GCN/CDNA compiler. > >> > >> I didn't want to invest all the effort it would take to support > >> wavefrontsize=32, which would be the natural mode for these devices; the > >> number of places that have "64" hard-coded is just too big. Not only that, > >> but > >> the EXEC and VCC registers change from DImode to SImode and that's going to > >> break a lot of stuff. (And we have no paying customer for this.) > >> > >> I'm open to patch submissions. :) > > > > OK, I see ;) As said for fully masked that's a good answer. I'd > > probably still not expose V64mode modes in the RTL expanders for the > > vect_* patterns? Or, what happens if you change > > gcn_vectorize_preferred_simd_mode to return 32 lane modes for RDNA > > and omit 64 lane modes from gcn_autovectorize_vector_modes for RDNA? > > Changing the preferred mode probably would fix permute. > > > Does that possibly leave performance on the plate? (not sure if there's > > any documents about choosing wavefrontsize=64 vs 32 with regard to > > performance) > > > > Note it would entirely forbit the vectorizer from using larger modes, > > it just makes it prefer the smaller ones. OTOH if you then run > > wavefrontsize=64 ontop of it it's probably wasting the 2nd instruction > > by always masking it? > > Right, the GPU will continue to process the "top half" of the vector as an > additional step, regardless whether you put anything useful there, or not. > > > So yeah. Guess a s/64/wavefrontsize/ would be a first step towards > > allowing 32 there ... > > I think the DImode to SImode change is the most difficult fix. Unless you know > of a cunning trick, that's going to mean a lot of changes to a lot of the > machine description; substitutions, duplications, iterators, indirections, > etc., etc., etc.
Hmm, maybe just leave it at DImode in the patterns? OTOH mode iterators to do both SImode and DImode might work as well, but yeah, a lot of churn. Richard.