Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

Richard Biener Thu, 15 Feb 2024 04:31:36 -0800

On Thu, 15 Feb 2024, Andrew Stubbs wrote:

> On 15/02/2024 10:21, Richard Biener wrote:
> [snip]
> >>> I suppse if RDNA really only has 32 lane vectors (it sounds like it,
> >>> even if it can "simulate" 64 lane ones?) then it might make sense to
> >>> vectorize for 32 lanes?  That said, with variable-length it likely
> >>> doesn't matter but I'd not expose fixed-size modes with 64 lanes then?
> >>
> >> For most operations, wavefrontsize=64 works just fine; the GPU runs each
> >> instruction twice and presents a pair of hardware registers as a logical
> >> 64-lane register. This breaks down for permutations and reductions, and is
> >> obviously inefficient when they vectors are not fully utilized, but is
> >> otherwise compatible with the GCN/CDNA compiler.
> >>
> >> I didn't want to invest all the effort it would take to support
> >> wavefrontsize=32, which would be the natural mode for these devices; the
> >> number of places that have "64" hard-coded is just too big. Not only that,
> >> but
> >> the EXEC and VCC registers change from DImode to SImode and that's going to
> >> break a lot of stuff. (And we have no paying customer for this.)
> >>
> >> I'm open to patch submissions. :)
> > 
> > OK, I see ;)  As said for fully masked that's a good answer.  I'd
> > probably still not expose V64mode modes in the RTL expanders for the
> > vect_* patterns?  Or, what happens if you change
> > gcn_vectorize_preferred_simd_mode to return 32 lane modes for RDNA
> > and omit 64 lane modes from gcn_autovectorize_vector_modes for RDNA?
> 
> Changing the preferred mode probably would fix permute.
> 
> > Does that possibly leave performance on the plate? (not sure if there's
> > any documents about choosing wavefrontsize=64 vs 32 with regard to
> > performance)
> > 
> > Note it would entirely forbit the vectorizer from using larger modes,
> > it just makes it prefer the smaller ones.  OTOH if you then run
> > wavefrontsize=64 ontop of it it's probably wasting the 2nd instruction
> > by always masking it?
> 
> Right, the GPU will continue to process the "top half" of the vector as an
> additional step, regardless whether you put anything useful there, or not.
> 
> > So yeah.  Guess a s/64/wavefrontsize/ would be a first step towards
> > allowing 32 there ...
> 
> I think the DImode to SImode change is the most difficult fix. Unless you know
> of a cunning trick, that's going to mean a lot of changes to a lot of the
> machine description; substitutions, duplications, iterators, indirections,
> etc., etc., etc.


Hmm, maybe just leave it at DImode in the patterns?  OTOH mode
iterators to do both SImode and DImode might work as well, but yeah,
a lot of churn.

Richard.

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

Reply via email to