https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304
--- Comment #10 from rguenther at suse dot de <rguenther at suse dot de> --- On Mon, 3 Jun 2024, ams at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304 > > --- Comment #9 from Andrew Stubbs <ams at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #6) > > The best strathegy for GCN would be to gather V4QImode aka SImode into the > > V64QImode (or V16SImode) vector. For pix2 we have a gap of 28 elements, > > doing consecutive loads isn't a good strategy here. > > I don't fully understand what you're trying to say here, so apologies if you > knew all this already and I missed the point..... > > In general, on GCN V4QImode is not in any way equivalent to SImode (when the > values are in registers). The vector registers are not one single string of > re-interpretable bits. > > For the same reason, you can't load a value as V64QImode and then try to > interpret it as V16SImode. GCN vector registers just don't work like > SSE/Neon/etc. > > When you load a V64QImode vector, each lane is extended to 32 bits, so what > you > actually get in hardware is a V64SImode vector. > > Likewise, when you load a V4QImode vector the hardware representation is > actually V4SImode (which in itself is just V64SImode with undefined values in > the unused lanes). I see. I wonder if there's not one or two latent wrong-code because of this and the vectorizers assumptions ;) I suppose modes_tieable_p will tell us whether a VIEW_CONVERT_EXPR will do the right thing? Is GET_MODE_SIZE (V64QImode) == GET_MODE_SIZE (V64SImode) btw? And V64QImode really V64PSImode? Still for a V64QImode load on { c[0], c[1], c[2], c[3], c[32], c[33], c[34], c[35], ... } it's probably best to use a single V64QImode gather with GCN then rather than four "consecutive" V64QImode loads and then element swizzling.